Text Ranking and Classification using Data Compression

Nitya Kasturi; Igor L. Markov

arXiv:2109.11577·cs.LG·December 8, 2021

Text Ranking and Classification using Data Compression

Nitya Kasturi, Igor L. Markov

PDF

Open Access 1 Repo

TL;DR

This paper introduces Zest, a language-agnostic text classification and ranking method based on data compression, which simplifies configuration and competes with existing embedding techniques in practical applications.

Contribution

The paper presents Zest, an improved data compression-based approach for text categorization that reduces complexity and enhances performance compared to prior compression methods.

Findings

01

Zest simplifies text classification without extensive feature engineering.

02

Zest can compete with language-specific embeddings in real-world scenarios.

03

It does not outperform counting methods on public datasets.

Abstract

A well-known but rarely used approach to text categorization uses conditional entropy estimates computed using data compression tools. Text affinity scores derived from compressed sizes can be used for classification and ranking tasks, but their success depends on the compression tools used. We use the Zstandard compressor and strengthen these ideas in several ways, calling the resulting language-agnostic technique Zest. In applications, this approach simplifies configuration, avoiding careful feature extraction and large ML models. Our ablation studies confirm the value of individual enhancements we introduce. We show that Zest complements and can compete with language-specific multidimensional content embeddings in production, but cannot outperform other counting methods on public datasets.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

facebookresearch/zest
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling · Advanced Text Analysis Techniques