Normalized Information Distance

Paul M.B. Vitanyi (CWI; Univ. Amsterdam); Frank J. Balbach (Univ.; Waterloo); Rudi L. Cilibrasi (CWI); and Ming Li (Univ. Waterloo)

arXiv:0809.2553·cs.IR·September 16, 2008·5 cites

Normalized Information Distance

Paul M.B. Vitanyi (CWI, Univ. Amsterdam), Frank J. Balbach (Univ., Waterloo), Rudi L. Cilibrasi (CWI), and Ming Li (Univ. Waterloo)

PDF

Open Access 1 Repo

TL;DR

This paper discusses the normalized information distance, a universal, theoretically grounded measure based on Kolmogorov complexity, and explores practical approximations for various data types to enable feature-free clustering and data mining.

Contribution

It introduces practical methods to approximate the normalized information distance using compression and web statistics, enabling broad applications in machine learning and data analysis.

Findings

01

Effective clustering in bioinformatics and music analysis

02

Successful application in machine translation tasks

03

Demonstrates feasibility of parameter-free data mining

Abstract

The normalized information distance is a universal distance measure for objects of all kinds. It is based on Kolmogorov complexity and thus uncomputable, but there are ways to utilize it. First, compression algorithms can be used to approximate the Kolmogorov complexity if the objects have a string representation. Second, for names and abstract concepts, page count statistics from the World Wide Web can be used. These practical realizations of the normalized information distance can then be applied to machine learning tasks, expecially clustering, to perform feature-free and parameter-free data mining. This chapter discusses the theoretical foundations of the normalized information distance and both practical realizations. It presents numerous examples of successful real-world applications based on these distance measures, ranging from bioinformatics to music clustering to machine…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

W95Psp/NID-results
none

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAlgorithms and Data Compression · Computability, Logic, AI Algorithms · semigroups and automata theory