Normalized Compression Distance of Multisets with Applications
Andrew R. Cohen (Electrical, Computer Engineering, Drexel, University, Philadelphia), Paul M. B. Vitanyi (CWI, University of, Amsterdam)

TL;DR
This paper introduces a new normalized compression distance for multisets that is a metric, enabling improved similarity measurement and classification across various data types, surpassing previous pairwise NCD methods.
Contribution
It develops the first metric NCD for multisets, extending the applicability of compression-based similarity measures to more complex data structures.
Findings
Achieved better classification accuracy in retinal progenitor cell data.
Improved handwritten digit recognition accuracy.
Enhanced data analysis in biological and synthetic datasets.
Abstract
Normalized compression distance (NCD) is a parameter-free, feature-free, alignment-free, similarity measure between a pair of finite objects based on compression. However, it is not sufficient for all applications. We propose an NCD of finite multisets (a.k.a. multiples) of finite objects that is also a metric. Previously, attempts to obtain such an NCD failed. We cover the entire trajectory from theoretical underpinning to feasible practice. The new NCD for multisets is applied to retinal progenitor cell classification questions and to related synthetically generated data that were earlier treated with the pairwise NCD. With the new method we achieved significantly better results. Similarly for questions about axonal organelle transport. We also applied the new NCD to handwritten digit recognition and improved classification accuracy significantly over that of pairwise NCD by…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
