Evaluating the Impact of Information Distortion on Normalized Compression Distance
Ana Granados, Manuel Cebrian, David Camacho, Francisco de B. Rodriguez

TL;DR
This study investigates how various information distortions affect the Kolmogorov complexity and clustering accuracy of classical English books using Normalized Compression Distance, highlighting the importance of modifying frequent words.
Contribution
It introduces specific distortion techniques and evaluates their impact on complexity and clustering, providing insights into information preservation in compression-based analysis.
Findings
Modifying frequent words best preserves clustering accuracy.
Different distortion methods have varying impacts on complexity.
Empirical results explain the effects of distortions on information content.
Abstract
In this paper we apply different techniques of information distortion on a set of classical books written in English. We study the impact that these distortions have upon the Kolmogorov complexity and the clustering by compression technique (the latter based on Normalized Compression Distance, NCD). We show how to decrease the complexity of the considered books introducing several modifications in them. We measure how the information contained in each book is maintained using a clustering error measure. We find experimentally that the best way to keep the clustering error is by means of modifications in the most frequent words. We explain the details of these information distortions and we compare with other kinds of modifications like random word distortions and unfrequent word distortions. Finally, some phenomenological explanations from the different empirical results that have been…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsComputability, Logic, AI Algorithms · Algorithms and Data Compression · Fractal and DNA sequence analysis
