[[BIRCH]]を多変量時系列データの異常検知に使用するケースがある。
[[2022__CSUR__Anomaly Detection and Failure Root Cause Analysis in (Micro)Service-Based Cloud Applications - A Survey|Soldani, CSUR2022]]で参照されている[[2018__CLOUD__Detecting Anomalous Behavior of Black-Box Services Modeled with Distance-Based Online Clustering|Gulenko+, CLOUD2018]]の説明を以下に引用する。
> 特に、Gulenkoら[27]は、異なるサービスに対して異なるベースラインモデルを構築することによ り、マルチサービスアプリケーションにおける性能異常を検出することを可能にしている。ある時点で サービス上で監視された KPI はベクトルとしてモデル化され、BIRCH [94] オンラインクラスタリング アルゴリズムが初期学習段階で使用され、監視された KPI に対応するベクトルがクラスタリングされる 。学習フェーズの終わりに、得られたクラスタのセントロイドとセントロイド半径は、監視されたKPIの ベクトルが「正常」として分類されるために関連するはずのベースライン部分空間を決定する。これにより、対応するベクトルがベースライン部分空間のいずれかに該当するかどうか、すなわち、対応する セントロイド半径よりも監視されたKPIのベクトルからセントロイドが離れていないクラスタが存在するかどうかをチェックすることで、各サービスの新しく監視されたKPIのオンライン処理が可能になる。そうでない場合、対応するサービスは性能異常の影響を受けるとみなされる。
[Using BIRCH for anomaly detection with QuestDB | QuestDB](https://questdb.io/blog/2022/08/22/using-birch-anomaly-detection-questdb/)
- 教師なし学習アルゴリズムの一つで、各グループが類似のデータポイントを含むようにデータをグループ化するために使用されます。データセット全体を通過する回数を最小限にし、処理を効率化するため、大規模なデータセットに適しています。
- BIRCHは、異常検知の分野でも有効です。これは、ある規範から逸脱し、ある要件を満たさず、本質的に望ましくないデータポイントを特定するものです。多くのデータポイントがある場合、異常検知を利用して目立つデータポイントを特定するのです。不良品の特定、産業機器の予知保全、サイバー侵入や不正の検知などに有効です。
## ChatGPTが生成したコード
[[2018__CLOUD__Detecting Anomalous Behavior of Black-Box Services Modeled with Distance-Based Online Clustering|Gulenko+, CLOUD2018]]の論文のテキストをプロンプトに含めて、ChatGPTが生成したコード。
# me
Can you code the below description in Python?
We consider a component anomalous when its behavior deviates from normal behavior. The behavior of a component C attimetisdefinedasavectorMt,C ofnmetricsxi ∈R, where i = 1,2,...,n. Behavior metrics for C can also be values x ∈/ R, e.g. categorical variables, in which case a mapping function is required to project such values into R. The normal behavior of C is modeled through a function normalC : Rn → {0, 1}, which returns 1, if the input vector describes normal behavior of C, and 0 otherwise.
The normal behavior model normalC can be realized through a number of binary classification models. However, only few models fulfill the following requirements of an anomaly detection component for a self-healing cloud in- frastructure:
1) It should work online and with limited memory.
2) It should have a low detection latency.
3) It should have minimal interference with the moni-
tored system.
4) It should not depend on training data from anomalous
behavior.
To meet these requirements, we propose to realize normalC by dividing the metric space Rn into two distinct sub spaces for normal and abnormal behavior. The subspace for normal behavior of C is described by a set NC of centroids (c,r), where c ∈ Rn is the center and r ∈ R is the radius of the centroid. The subspace for the abnormal
behavior is the metric space not occupied by any centroid,
n[i.e. R \ NC. Therefore, the normal behavior model is defined as:
$\operatorname{normal}_C(M)= \begin{cases}1 & \text { iff } \exists(c, r) \in \mathcal{N}_C \mid\|M-c\| \leq r \\ 0 & \text { otherwise }\end{cases}$
In other words, normalC returns a positive result, iff any of the centroids contains the incoming sample.
The problem is now reduced to creating the centroid set NC. For that purpose, a number of normal behavior measurements Bnorm must be collected from the observed C component C. Note that no anomaly data is necessary. The normal behavior data can be obtained by simply observing the component for a certain period of time, following the principle that each component behaves normally the majority of time. In a system with frequent anomalous behavior, an administrator can assert the correct operation for a given observation period.
We construct the centroid set NC through the first phase of the online clustering algorithm BIRCH[1]. In that phase, BIRCH creates a micro-cluster approximation of a continuous input stream of vector data. The complete BIRCH algorithm consists of three additional phases: condensing the micro-cluster representation to a smaller memory footprint, an additional global clustering step using the k-means algorithm, and an optional offline refinement step of the resulting clusters. We omit those additional phases and use the micro-clusters produced in the first phase as the centroid set NC describing the normal operation of component C.
BIRCH operates iteratively and does not store the raw input data, which fulfills the previously stated requirements. The micro-clusters are maintained as cluster features (CFs), where each CF summarizes a subset of previously observed input vectors P by storing the number of vectors, the sum, and the squared sum of the vectors. In other words, a CF
summarizing a vector set P is defined as the following tuple:
$\left(|P|, \sum_{x \in P} x, \sum_{x \in P} x^T x\right)$
The stored data can be updated iteratively, and the center and radius of the summarized centroid can be computed from it:
$
\begin{gathered}
c_P=\frac{\sum_{x \in P} x}{|P|} \\
r_P=\sqrt{\frac{\left|\sum_{x \in P} x^T x\right|}{|P|}-\left(\frac{\left|\sum_{x \in P} x\right|}{|P|}\right)^2}
\end{gathered}
$
Therefore, the centroid set NC necessary to compute normalC can be obtained at any time. To optimize the insertion of new input data, the CFs are maintained in a height-balanced tree.
Besides meeting the previously stated requirements, BIRCH also allows online refinements of the normal operation model, as additional data vectors can be inserted into the CF tree at any time. This allows the model to adapt to changed system behavior, like changed workloads. Model updates can be triggered externally, either automatically when the system workload is changed, or manually by an administrator.
# ai
The task you've described is quite complex and will require multiple classes and functions to be implemented. Here is a starting point in Python based on your specifications:
```python
import numpy as np
from scipy.spatial import cKDTree
class ClusterFeature:
def __init__(self):
self.n_vectors = 0
self.vector_sum = None
self.squared_vector_sum = None
def update(self, vector):
if self.vector_sum is None:
self.vector_sum = vector
self.squared_vector_sum = vector ** 2
else:
self.vector_sum += vector
self.squared_vector_sum += vector ** 2
self.n_vectors += 1
def centroid(self):
return self.vector_sum / self.n_vectors
def radius(self):
return np.sqrt(self.squared_vector_sum / self.n_vectors - np.linalg.norm(self.vector_sum / self.n_vectors) ** 2)
class BIRCH:
def __init__(self):
self.cluster_features = []
def insert(self, vector):
cf = ClusterFeature()
cf.update(vector)
self.cluster_features.append(cf)
def get_centroids(self):
return [cf.centroid() for cf in self.cluster_features]
def get_radii(self):
return [cf.radius() for cf in self.cluster_features]
class AnomalyDetection:
def __init__(self):
self.birch = BIRCH()
def train(self, data):
for vector in data:
self.birch.insert(vector)
self.tree = cKDTree(self.birch.get_centroids())
self.radii = np.array(self.birch.get_radii())
def is_normal(self, vector):
distance, index = self.tree.query(vector)
return distance <= self.radii[index]
```
Here's a basic rundown of the code:
- `ClusterFeature` is a class representing the cluster features (CFs) in BIRCH. It stores the number of vectors, the sum, and the squared sum of the vectors, and provides methods to update these values and calculate the centroid and radius.
- `BIRCH` is a class representing the BIRCH algorithm. It maintains a list of `ClusterFeature` objects and provides a method to insert a new vector into the appropriate CF. It also provides methods to retrieve all centroids and radii.
- `AnomalyDetection` is a class that uses the BIRCH algorithm for anomaly detection. It provides a method `train` to train the model on normal behavior data, and a method `is_normal` to determine whether a given vector describes normal behavior.
Please note that this is a simplified implementation. The actual BIRCH algorithm includes more sophisticated features like a height-balanced tree for maintaining the CFs and methods for splitting and merging CFs. The `train` and `is_normal` methods assume that the vectors are numpy arrays, and the `train` method assumes that the data is a list of numpy arrays.
Also note that this implementation doesn't handle categorical variables, as you mentioned in your description. For categorical variables, you would need to define a mapping function that projects the categorical values into real numbers.
# me
Can you rewrite the code with sklearn.cluster.Birch ?
# ai
Sure, here is an example of how you can use the Birch implementation in sklearn for the task. In this example, I'll assume that your data vectors are numpy arrays and that you're using Euclidean distance (the default for Birch) as your distance metric.
```python
import numpy as np
from sklearn.cluster import Birch
class AnomalyDetection:
def __init__(self, threshold=0.5, branching_factor=50):
self.birch = Birch(threshold=threshold, branching_factor=branching_factor, n_clusters=None)
def train(self, data):
self.birch.fit(data)
self.centroids_ = self.birch.subcluster_centers_
self.radii_ = self.calculate_radii()
def calculate_radii(self):
radii = []
for subcluster in self.birch.subcluster_labels_:
mask = self.birch.labels_ == subcluster
cluster_points = self.birch.root_.data_[mask]
radii.append(self.calculate_radius(cluster_points))
return np.array(radii)
def calculate_radius(self, points):
centroid = points.mean(axis=0)
return np.sqrt(((points - centroid) ** 2).sum(axis=1).mean())
def is_normal(self, vector):
distances = np.linalg.norm(self.centroids_ - vector, axis=1)
return np.any(distances <= self.radii_)
```
This code creates an instance of the `Birch` class from `sklearn.cluster`, trains it on your data, and uses it to determine whether a given vector describes normal behavior. The method `calculate_radii` calculates the radius of each subcluster, and the method `is_normal` checks whether the given vector is within the radius of any subcluster. The radius of a subcluster is calculated as the root mean square of the distances from the points in the subcluster to the centroid of the subcluster.
This implementation assumes that you're using Birch's default Euclidean distance metric, and that a vector is considered normal if it's within the radius of any subcluster. If you're using a different distance metric, or if you have a different definition of normal behavior, you'll need to adjust the code accordingly.
Please note that this code uses the root mean square as the radius of a subcluster, which may not be the best choice for all data sets. You could also consider using the maximum distance, the mean distance, or some other measure of the dispersion of the points in the subcluster.