Generalized Compression Dictionary Distance as Universal Similarity   Measure

Andrey Bogomolov; Bruno Lepri; Fabio Pianesi

arXiv:1410.5792·stat.ML·October 22, 2014·2 cites

Generalized Compression Dictionary Distance as Universal Similarity Measure

Andrey Bogomolov, Bruno Lepri, Fabio Pianesi

PDF

Open Access

TL;DR

This paper introduces a new similarity measure based on compression dictionaries that outperforms previous methods in clustering and machine learning tasks, offering efficiency, scalability, and content independence.

Contribution

The paper proposes a novel similarity measure using compression dictionaries that is more efficient, scalable, and content-independent than existing methods like NCD, suitable for various machine learning applications.

Findings

01

Outperforms Normalized Compression Distance in clustering tasks.

02

Eliminates the need for actual compression, increasing efficiency.

03

Scales linearly with data size and is content independent.

Abstract

We present a new similarity measure based on information theoretic measures which is superior than Normalized Compression Distance for clustering problems and inherits the useful properties of conditional Kolmogorov complexity. We show that Normalized Compression Dictionary Size and Normalized Compression Dictionary Entropy are computationally more efficient, as the need to perform the compression itself is eliminated. Also they scale linearly with exponential vector size growth and are content independent. We show that normalized compression dictionary distance is compressor independent, if limited to lossless compressors, which gives space for optimizations and implementation speed improvement for real-time and big data applications. The introduced measure is applicable for machine learning tasks of parameter-free unsupervised clustering, supervised learning such as classification and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsComputability, Logic, AI Algorithms · Algorithms and Data Compression · semigroups and automata theory