[[2010__CSUR__A Survey of Online Failure Prediction Methods]]
> Several attempts have been made to get to a precise definition of faults, errors, and failures, among which are Melliar-Smith and Randell [1977]; Aviz ̆ienis and Laprie [1986]; Laprie and Kanoun [1996]; and IEC: International Technical Comission [2002]; Siewiorek and Swarz [1998], page 22; and most recently Avizˇienis et al. [2004]. Since the latter seems to have broad acceptance, its definitions are used in this article, with some additional extensions and interpretations.
> —A failure is defined as “an event that occurs when the delivered service deviates from correct service.” The main point here is that a failure refers to misbehavior that can be observed by the user, which can either be a human or another computer system. Things may go wrong inside the system, but as long as the problem does not result in incorrect output (including the case where there is no output at all) there is no failure.
> —The situation when “things go wrong” in the system can be formalized as the situation when the system’s state deviates from the correct state, which is called an error. Hence, “an error is the part of the total state of the system that may lead to its subsequent service failure.”
> —Finally, faults are the adjudged or hypothesized cause of an error—the root cause of an error. In most cases, faults remain dormant for some time and, once they become active, they cause an incorrect system state, which is an error. That is why errors are also called manifestations of faults. Several classifications of faults have been pro- posed in the literature, among which the distinction between transient, intermittent, and permanent faults [Siewiorek and Swarz 1998, page 22] is best known.
> —The definition of an error implies that the activation of a fault leads to an incorrect state; however, this does not necessarily mean that the system knows about it. In addition to the definitions given by Avizˇienis et al. [2004], we distinguish between undetected errors and detected errors: an error remains undetected until an error detector identifies the incorrect state.
> —Besides causing a failure, undetected or detected errors may cause out-of-norm behavior of system parameters as a side effect. We call this out-of-norm behavior a symptom.3 In the context of software aging, symptoms are similar to aging-related errors, as implicitly introduced in Grottke and Trivedi [2007] and explicitly named in Grottke et al. [2008].
![[Pasted image 20241025063453.png]]