DBSCANのεパラメータの決め方 - k-dist

[[DBSCAN]]がはじめて提案された論文である[[1996__KDDI__A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise|Ester+, KDD1996]]の”4.2 Determining the Parameters Eps and MinPts”で述べられているヒューリスティクスに半径εを決める方法。 > ある点pとそのk番目の最近傍との距離をdとすると、pのd-近傍はほぼ全ての点pに対してちょうどk+l個の点を含む。さらに、クラスタ内のある点のkを変えても、dが大きく変化することはない。これは、k=1,2,3・・・.k番目の最近傍が、クラスタ内のある点に対して一般に真でない直線上にほぼ位置している場合にのみ起こることである。 > 与えられたkに対して、データベースDから実数への関数k-distを定義し、各点をそのk番目の最近傍からの距離に対応させる。データベースの点をk-distの値の降順に並べると、この関数のグラフはデータベース内の密度分布に関するヒントを与えてくれる。このグラフをソートされたk-distグラフと呼ぶ。任意の点pを選び、パラメータEpsをk-dist(p)に設定し、パラメータMinPtsをkとすると、k-dist値が等しいか小さい点はすべてコア点となる。もし、Dの「最下層」クラスタでk-dist値が最大となる閾値点を見つけることができれば、望ましいパラメータ値を持つことになる。閾値は、ソートされたk-distグラフの最初の「谷」の最初の点である(図4参照)。k-dist値が高い点(閾値の左側)はすべてノイズとみなされ、それ以外の点(閾値の右側)は何らかのクラスタに割り当てられる。 > > 一般に、最初の "谷 "を自動的に検出することは非常に困難であるが、ユーザがこの谷をグラフィカルな表現で見ることは比較的簡単である。そこで、我々は閾値を決定するための対話的なアプローチを提案する。DBSCANは EpsとMinPtsの2つのパラメータを必要とする。しかし、我々の実験では、k > 4のk-distグラフは4-distグラフと有意な差がなく、さらにかなり多くの計算を必要とすることが示された。そこで、全てのデータベース(2次元データ)に対してパラメータMinPtsを4に設定することで、これを排除する。DBSCANのパラメータEpsを決定するために、以下の対話的なアプローチを提案する。 - データベースの4-distグラフを計算し、表示するシステムです。 - ノイズの割合が推定できる場合は、その割合を入力し、そこから閾値の案を導き出す。 - ユーザは提案された閾値を受け入れるか、別の点を閾値として選択します。閾値の4-dist値は、DBSCANのEps値として使用されます。 ![[Pasted image 20230505200534.png|400]] --- ChatGPTによる実装 The k-distance plot is a technique used to determine an appropriate epsilon value for DBSCAN. It involves calculating the k-th nearest neighbor distances for all data points and plotting the distances in ascending order. A good epsilon value is usually where there's a noticeable "elbow" or steep increase in the plot. Here's a Python implementation using the `scikit-learn` library to calculate the k-distance plot: ```python import numpy as np import matplotlib.pyplot as plt from sklearn.datasets import make_blobs from sklearn.neighbors import NearestNeighbors # Generate some sample data data, _ = make_blobs(n_samples=300, centers=4, random_state=42) # Set the number of nearest neighbors (k) k = 4 # Calculate the k-nearest neighbors for all data points knn = NearestNeighbors(n_neighbors=k) knn.fit(data) distances, _ = knn.kneighbors(data) # Sort the distances and get the k-th nearest distance for each point k_distances = np.sort(distances[:, -1]) # Plot the k-distance plot plt.plot(range(len(k_distances)), k_distances) plt.xlabel('Data Points (sorted by distance)') plt.ylabel(f'{k}-th Nearest Neighbor Distance') plt.title(f'{k}-distance Plot') plt.show() ``` In this example, we generate some sample data with 4 clusters and set k to 4. We then calculate the k-nearest neighbors for all data points using the `NearestNeighbors` class from `scikit-learn`. The k-th nearest neighbor distances are sorted and plotted. To determine an appropriate epsilon value from the k-distance plot, look for a point where there's a noticeable "elbow" or steep increase in the plot. This elbow point represents a suitable epsilon value, as it marks the transition between dense regions (clusters) and sparser regions (noise or between clusters).