Inhalt des Dokuments
Es gibt keine deutsche Übersetzung dieser Webseite.
Bachelor Thesis: Anomaly Detection with BIRCH Clustering
Anomaly Detection with BIRCH Clustering
In this thesis, we implement and evaluate an unsupervised anomaly detection algorithm that utilizes an ensemble of clusterings created by the BIRCH clustering algorithm. Each clustering is created with a different order of the input data to overcome shortcomings of a single run of the BIRCH clustering algorithm. Based on the sizes of the clusters to which a data point is assigned by each BIRCH clustering, we calculate an outlier score for each data point. The basic assumption is that normal data points are assigned to larger clusters than outliers, since outliers are per definition rare and different from the majority of the data. We evaluate how different input parameter values affect the detection quality of this method. We use both, synthetic and realistic data sets for the evaluation. Labels that indicate whether a data point is normal or an anomaly, are available as well. However, they are not used for training but only to assess the detection quality with the metrics precision at n, average precision and area under the ROC curve. We found that an ensemble of BIRCH clusterings performs better than a single BIRCH clustering. Furthermore, we compare the BIRCH ensembles to the anomaly detection algorithms One Class Support Vector Machine and isolation forest in terms of training time and metric results.. The results show that the BIRCH ensemble method gives equally good or better metric results than the other algorithms. Moreover, for some data sets we show that the BIRCH ensemble is faster than the One Class SVM while giving better metric results as well. Additionally, we introduce two metrics that are able to assess the performance of a BIRCH ensemble configuration without using the data labels. Although the metrics do not coincide with the other metrics, they restrict the range of values for the threshold parameter. When using an ensemble within which each run of the BIRCH algorithm uses the same threshold, those with a threshold within this range perform better.
Supervisor: Boris Lorbeer 
Type: Bachelor Thesis
Duration: 4 months
10587 Berlin, Germany
Phone: +49 30 8353 58811
Fax: +49 30 8353 58409