FaultとFailureの差異 - yuuk1's Digital Garden

書籍 Martin Kleppmann著 [[データ指向アプリケーションデザイン]] 2019より引用． ## Reliability > ... 信頼性とは「何か問題が生じたとしても正しく動作し続けること」と言えるでしょう。 > ... then we can understand reliability as meaning, roughly, “continuing to work correctly, even when things go wrong.” ## Fault > 問題を起こしうるものはフォールト(fault)と呼ばれ、フォールトの存在を見越して対処できるようなシステムは耐障害性を持つ(フォールトトレラント、fault tolerant)もしくはレジリエント (resilient)であると言います。 > The things that can go wrong are called faults, and systems that anticipate faults and can cope with them are called fault-tolerant or resilient. ## Failure > フォールトは障害と同じではないことに注意してください [2]。通常フォールトは仕様を満たしていないコンポーネントとして定義されますが、障害はシステムが全体として必要なサービスのユーザーへの提供を止めてしまった場合を指します。 > Note that a fault is not the same as a failure [2]. A fault is usually defined as one component of the system deviating from its spec, whereas a failure is when the system as a whole stops providing the required service to the user. 関連: [[2004__TDSC__Basic Concepts and Taxonomy of Dependable and Secure Computing]] --- [[IEEE Standard Classification for Software Anomalies]] > A failure may be caused by (and thus indicate the presence of) a fault. A fault may cause one or more failures. > defect: An imperfection or deficiency in a work product where that work product does not meet its requirements or specifications and needs to be either repaired or replaced. (adapted from the Project Management Institute [B19]3) > error: A human action that produces an incorrect result. (adapted from ISO/IEC 24765:2009 [B17]) > failure: (A) Termination of the ability of a product to perform a required function or its inability to perform within previously specified limits. (adapted from ISO/IEC 25000:2005 [B18]) (B) An event in which a system or system component does not perform a required function within specified limits. (adapted from ISO/IEC 24765:2009 [B17]) > fault: A manifestation of an error in software. (adapted from ISO/IEC 24765:2009 [B17]) --- [[2024__CSUR__A Survey on Failure Analysis and Fault Injection in AI Systems]] > We adopt the definitions of failures and faults proposed by previous work [7, 163]. Furthermore, we provide additional extensions and interpretations specific to AI systems. > - Failure is defined as "an incident that occurs when the delivered service deviates from the correct service" [7]. In the context of AI systems, failures can manifest in various ways. For example, a failure can occur when AI services become unreachable, and when the behavior of AI services does not meet the expected outcome (e.g., generating semantically incorrect text). These failures indicate a deviation from the desired or expected behavior of the AI system. > - Fault is the root cause of a failure. In AI systems, faults can be attributed to various sources, including algorithmic flaws, model design issues, or problems with the quality of the data used for training or inference. It is important to note that faults in AI systems may remain uncovered for some time, due to fault-tolerant approaches implemented in the system. 障害と故障の定義については、先行研究[7, 163]で提案されているものを採用する。さらに、AIシステム特有の拡張と解釈を追加する。 - 障害は、「提供されたサービスが正しいサービスから逸脱したときに発生するインシデント」[7]と定義される。AIシステムの文脈では、障害はさまざまな形で現れる。たとえば、AIサービスが到達不能になったときや、AIサービスの動作が期待された結果を満たさないとき（意味的に正しくないテキストを生成するなど）に障害が発生する。これらの障害は、AIシステムの望ましい、あるいは期待される動作からの逸脱を示す。 - 故障は障害の根本原因である。AIシステムにおける故障は、アルゴリズムの欠陥、モデル設計の問題、学習や推論に使用されるデータの質の問題など、さまざまな原因に起因する可能性がある。AIシステムにおける故障は、システムに実装されたフォールト・トレラント・アプローチにより、しばらくの間発見されないままである可能性があることに注意することが重要である。 [7]\: [[2004__TDSC__Basic Concepts and Taxonomy of Dependable and Secure Computing]] [163]\: [[2010__CSUR__A Survey of Online Failure Prediction Methods]] --- ![[Faults, Errors, Symptoms, and Failures - Salfner+, CSUR2010]] --- ![[インシデントのライフサイクル - Wu+, ICSE-SEIP2023]]