Text Ranking and Classification using Data Compression
Nitya Kasturi, Igor L. Markov

TL;DR
This paper introduces Zest, a language-agnostic text classification and ranking method based on data compression, which simplifies configuration and competes with existing embedding techniques in practical applications.
Contribution
The paper presents Zest, an improved data compression-based approach for text categorization that reduces complexity and enhances performance compared to prior compression methods.
Findings
Zest simplifies text classification without extensive feature engineering.
Zest can compete with language-specific embeddings in real-world scenarios.
It does not outperform counting methods on public datasets.
Abstract
A well-known but rarely used approach to text categorization uses conditional entropy estimates computed using data compression tools. Text affinity scores derived from compressed sizes can be used for classification and ranking tasks, but their success depends on the compression tools used. We use the Zstandard compressor and strengthen these ideas in several ways, calling the resulting language-agnostic technique Zest. In applications, this approach simplifies configuration, avoiding careful feature extraction and large ML models. Our ablation studies confirm the value of individual enhancements we introduce. We show that Zest complements and can compete with language-specific multidimensional content embeddings in production, but cannot outperform other counting methods on public datasets.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Advanced Text Analysis Techniques
