Fast and explainable clustering in the Manhattan and Tanimoto distance

Stefan G\"uttel; Kaustubh Roy

arXiv:2601.08781·cs.LG·January 14, 2026

Fast and explainable clustering in the Manhattan and Tanimoto distance

Stefan G\"uttel, Kaustubh Roy

PDF

Open Access

TL;DR

The paper extends the CLASSIX clustering algorithm to Manhattan and Tanimoto distances, achieving significantly faster performance and higher-quality clusters on chemical data benchmarks.

Contribution

It introduces a generalized version of CLASSIX using norm-based sorting and triangle inequality for multiple distance metrics, including a sharper inequality for Tanimoto distance.

Findings

01

CLASSIX Tanimoto is 30 times faster than Taylor-Butina.

02

CLASSIX Tanimoto is 80 times faster than DBSCAN.

03

The new algorithm produces higher-quality clusters.

Abstract

The CLASSIX algorithm is a fast and explainable approach to data clustering. In its original form, this algorithm exploits the sorting of the data points by their first principal component to truncate the search for nearby data points, with nearness being defined in terms of the Euclidean distance. Here we extend CLASSIX to other distance metrics, including the Manhattan distance and the Tanimoto distance. Instead of principal components, we use an appropriate norm of the data vectors as the sorting criterion, combined with the triangle inequality for search termination. In the case of Tanimoto distance, a provably sharper intersection inequality is used to further boost the performance of the new algorithm. On a real-world chemical fingerprint benchmark, CLASSIX Tanimoto is about 30 times faster than the Taylor--Butina algorithm, and about 80 times faster than DBSCAN, while computing…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Clustering Algorithms Research · Medical Image Segmentation Techniques · Data Mining Algorithms and Applications