Fast and explainable clustering in the Manhattan and Tanimoto distance
Stefan G\"uttel, Kaustubh Roy

TL;DR
The paper extends the CLASSIX clustering algorithm to Manhattan and Tanimoto distances, achieving significantly faster performance and higher-quality clusters on chemical data benchmarks.
Contribution
It introduces a generalized version of CLASSIX using norm-based sorting and triangle inequality for multiple distance metrics, including a sharper inequality for Tanimoto distance.
Findings
CLASSIX Tanimoto is 30 times faster than Taylor-Butina.
CLASSIX Tanimoto is 80 times faster than DBSCAN.
The new algorithm produces higher-quality clusters.
Abstract
The CLASSIX algorithm is a fast and explainable approach to data clustering. In its original form, this algorithm exploits the sorting of the data points by their first principal component to truncate the search for nearby data points, with nearness being defined in terms of the Euclidean distance. Here we extend CLASSIX to other distance metrics, including the Manhattan distance and the Tanimoto distance. Instead of principal components, we use an appropriate norm of the data vectors as the sorting criterion, combined with the triangle inequality for search termination. In the case of Tanimoto distance, a provably sharper intersection inequality is used to further boost the performance of the new algorithm. On a real-world chemical fingerprint benchmark, CLASSIX Tanimoto is about 30 times faster than the Taylor--Butina algorithm, and about 80 times faster than DBSCAN, while computing…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Clustering Algorithms Research · Medical Image Segmentation Techniques · Data Mining Algorithms and Applications
