Clustering by compression
Rudi Cilibrasi (CWI), Paul Vitanyi (CWI, University of, Amsterdam)

TL;DR
This paper introduces a universal, compression-based clustering method that works across diverse domains without domain-specific features, using normalized compression distance and hierarchical clustering.
Contribution
It proposes a new universal similarity measure based on compression, along with a hierarchical clustering approach and a quartet method for extracting clusters, applicable across many fields.
Findings
Successfully applied to genomics, linguistics, music, and more
Demonstrated robustness across different compressors
Provided new insights into mammalian evolution
Abstract
We present a new method for clustering based on compression. The method doesn't use subject-specific features or background knowledge, and works as follows: First, we determine a universal similarity distance, the normalized compression distance or NCD, computed from the lengths of compressed data files (singly and in pairwise concatenation). Second, we apply a hierarchical clustering method. The NCD is universal in that it is not restricted to a specific application area, and works across application area boundaries. A theoretical precursor, the normalized information distance, co-developed by one of the authors, is provably optimal but uses the non-computable notion of Kolmogorov complexity. We propose precise notions of similarity metric, normal compressor, and show that the NCD based on a normal compressor is a similarity metric that approximates universality. To extract a hierarchy…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAlgorithms and Data Compression · semigroups and automata theory · Computability, Logic, AI Algorithms
