Master Thesis: Parallel Self-Tuning Spectral Clustering on Apache Spark
Parallel Self-Tuning Spectral Clustering on Apache Spark
This thesis proposes a new implementation of the self-tuning
spectral clustering algorithm and a solution to use it on large
datasets by parallelizing the computation. The algorithm studied has
• It does not use an explicit model of data distribution (e.g. Gaussian) to find clusters of Observations.
• The clusters in a dataset do not need to have the same density in order to be found by the Algorithm.
• The algorithm does not require an input specifying the number of clusters in the dataset as it is self-tuned but only the minimum and maximum possible number of groups in it.
After describing the algorithm in details, an implementation developed in Scala is proposed. Compared to the two existing implementations, it is the first one to strictly follow the algorithm on steps such as the selection of the most probable number of clusters in the dataset.
Some steps of the algorithm are updated to make the computation faster. The resulting algorithm is usable as a library by other programs and different graphical user interfaces using it are presented.
The evaluation of the algorithm shows that the new implementation works as well as the one made for the original paper. The computation time of the algorithm is then evaluated for bigger datasets and shows that the algorithm is not usable in this configuration.
A solution to compute datasets containing a high number of small clusters is thus created. Using a k-d tree data structure, this thesis introduces a solution to cut datasets into tiles containing the same number of observations and process them in parallel using Apache Spark.
The parallel solution is evaluated and proved efficient, a dataset clustered in 5 hours by the original algorithm being clustered in 42 seconds. However, this solution only works on datasets that contain small clusters.
A solution for this problem, the tile border, is presented but future work could be done to make the parallelization usable on more various dataset types. Lastly, the computation time of the k-d tree and parallel computation is evaluated on datasets containing up to 10000 clusters.
Supervisor: Ana Kosareva, Boris Lorbeer
Type: Master Thesis
Duration: 6 months
10587 Berlin, Germany
Phone: +49 30 8353 58811
Fax: +49 30 8353 58409
e-mail query