Generalized Compression Dictionary Distance as Universal Similarity Measure
Andrey Bogomolov, Bruno Lepri, Fabio Pianesi

TL;DR
This paper introduces a new similarity measure based on compression dictionaries that outperforms previous methods in clustering and machine learning tasks, offering efficiency, scalability, and content independence.
Contribution
The paper proposes a novel similarity measure using compression dictionaries that is more efficient, scalable, and content-independent than existing methods like NCD, suitable for various machine learning applications.
Findings
Outperforms Normalized Compression Distance in clustering tasks.
Eliminates the need for actual compression, increasing efficiency.
Scales linearly with data size and is content independent.
Abstract
We present a new similarity measure based on information theoretic measures which is superior than Normalized Compression Distance for clustering problems and inherits the useful properties of conditional Kolmogorov complexity. We show that Normalized Compression Dictionary Size and Normalized Compression Dictionary Entropy are computationally more efficient, as the need to perform the compression itself is eliminated. Also they scale linearly with exponential vector size growth and are content independent. We show that normalized compression dictionary distance is compressor independent, if limited to lossless compressors, which gives space for optimizations and implementation speed improvement for real-time and big data applications. The introduced measure is applicable for machine learning tasks of parameter-free unsupervised clustering, supervised learning such as classification and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsComputability, Logic, AI Algorithms · Algorithms and Data Compression · semigroups and automata theory
