Discovering Dataset Nature through Algorithmic Clustering based on String Compression
Ana Granados, Kostadin Koroutchev, Francisco de Borja Rodr\'iguez

TL;DR
This paper investigates how different text dataset representations affect clustering performance by applying progressive text distortion and string compression, revealing the importance of preserving text structure for certain datasets.
Contribution
It introduces a novel approach combining text distortion and algorithmic clustering to analyze dataset nature based on structural preservation effects.
Findings
Structural datasets' clustering deteriorates with increased text distortion.
Using adjustable compressor context size helps identify dataset nature.
Results align with multidimensional projection methods.
Abstract
Text datasets can be represented using models that do not preserve text structure, or using models that preserve text structure. Our hypothesis is that depending on the dataset nature, there can be advantages using a model that preserves text structure over one that does not, and viceversa. The key is to determine the best way of representing a particular dataset, based on the dataset itself. In this work, we propose to investigate this problem by combining text distortion and algorithmic clustering based on string compression. Specifically, a distortion technique previously developed by the authors is applied to destroy text structure progressively. Following this, a clustering algorithm based on string compression is used to analyze the effects of the distortion on the information contained in the texts. Several experiments are carried out on text datasets and artificially-generated…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
