Clustering by compression

Rudi Cilibrasi (CWI); Paul Vitanyi (CWI; University of; Amsterdam)

arXiv:cs/0312044·cs.CV·May 23, 2007·62 cites

Clustering by compression

Rudi Cilibrasi (CWI), Paul Vitanyi (CWI, University of, Amsterdam)

PDF

Open Access

TL;DR

This paper introduces a universal, compression-based clustering method that works across diverse domains without domain-specific features, using normalized compression distance and hierarchical clustering.

Contribution

It proposes a new universal similarity measure based on compression, along with a hierarchical clustering approach and a quartet method for extracting clusters, applicable across many fields.

Findings

01

Successfully applied to genomics, linguistics, music, and more

02

Demonstrated robustness across different compressors

03

Provided new insights into mammalian evolution

Abstract

We present a new method for clustering based on compression. The method doesn't use subject-specific features or background knowledge, and works as follows: First, we determine a universal similarity distance, the normalized compression distance or NCD, computed from the lengths of compressed data files (singly and in pairwise concatenation). Second, we apply a hierarchical clustering method. The NCD is universal in that it is not restricted to a specific application area, and works across application area boundaries. A theoretical precursor, the normalized information distance, co-developed by one of the authors, is provably optimal but uses the non-computable notion of Kolmogorov complexity. We propose precise notions of similarity metric, normal compressor, and show that the NCD based on a normal compressor is a similarity metric that approximates universality. To extract a hierarchy…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAlgorithms and Data Compression · semigroups and automata theory · Computability, Logic, AI Algorithms