Discovering Dataset Nature through Algorithmic Clustering based on   String Compression

Ana Granados; Kostadin Koroutchev; Francisco de Borja Rodr\'iguez

arXiv:2502.00208·cs.IT·February 4, 2025

Discovering Dataset Nature through Algorithmic Clustering based on String Compression

Ana Granados, Kostadin Koroutchev, Francisco de Borja Rodr\'iguez

PDF

TL;DR

This paper investigates how different text dataset representations affect clustering performance by applying progressive text distortion and string compression, revealing the importance of preserving text structure for certain datasets.

Contribution

It introduces a novel approach combining text distortion and algorithmic clustering to analyze dataset nature based on structural preservation effects.

Findings

01

Structural datasets' clustering deteriorates with increased text distortion.

02

Using adjustable compressor context size helps identify dataset nature.

03

Results align with multidimensional projection methods.

Abstract

Text datasets can be represented using models that do not preserve text structure, or using models that preserve text structure. Our hypothesis is that depending on the dataset nature, there can be advantages using a model that preserves text structure over one that does not, and viceversa. The key is to determine the best way of representing a particular dataset, based on the dataset itself. In this work, we propose to investigate this problem by combining text distortion and algorithmic clustering based on string compression. Specifically, a distortion technique previously developed by the authors is applied to destroy text structure progressively. Following this, a clustering algorithm based on string compression is used to analyze the effects of the distortion on the information contained in the texts. Several experiments are carried out on text datasets and artificially-generated…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.