Normalized Information Distance
Paul M.B. Vitanyi (CWI, Univ. Amsterdam), Frank J. Balbach (Univ., Waterloo), Rudi L. Cilibrasi (CWI), and Ming Li (Univ. Waterloo)

TL;DR
This paper discusses the normalized information distance, a universal, theoretically grounded measure based on Kolmogorov complexity, and explores practical approximations for various data types to enable feature-free clustering and data mining.
Contribution
It introduces practical methods to approximate the normalized information distance using compression and web statistics, enabling broad applications in machine learning and data analysis.
Findings
Effective clustering in bioinformatics and music analysis
Successful application in machine translation tasks
Demonstrates feasibility of parameter-free data mining
Abstract
The normalized information distance is a universal distance measure for objects of all kinds. It is based on Kolmogorov complexity and thus uncomputable, but there are ways to utilize it. First, compression algorithms can be used to approximate the Kolmogorov complexity if the objects have a string representation. Second, for names and abstract concepts, page count statistics from the World Wide Web can be used. These practical realizations of the normalized information distance can then be applied to machine learning tasks, expecially clustering, to perform feature-free and parameter-free data mining. This chapter discusses the theoretical foundations of the normalized information distance and both practical realizations. It presents numerous examples of successful real-world applications based on these distance measures, ranging from bioinformatics to music clustering to machine…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAlgorithms and Data Compression · Computability, Logic, AI Algorithms · semigroups and automata theory
