Less is More: Parameter-Free Text Classification with Gzip
Zhiying Jiang, Matthew Y.R. Yang, Mikhail Tsirlin, Raphael Tang, Jimmy, Lin

TL;DR
This paper introduces a simple, parameter-free text classification method combining gzip compression with k-nearest neighbors, achieving competitive results without training and outperforming BERT on out-of-distribution datasets.
Contribution
The paper presents a novel, lightweight, non-parametric text classification approach that bypasses training and fine-tuning, offering a universal alternative to deep neural networks.
Findings
Achieves competitive accuracy on in-distribution datasets without training.
Outperforms BERT on all out-of-distribution datasets, including low-resource languages.
Excels in few-shot learning scenarios with scarce labeled data.
Abstract
Deep neural networks (DNNs) are often used for text classification tasks as they usually achieve high levels of accuracy. However, DNNs can be computationally intensive with billions of parameters and large amounts of labeled data, which can make them expensive to use, to optimize and to transfer to out-of-distribution (OOD) cases in practice. In this paper, we propose a non-parametric alternative to DNNs that's easy, light-weight and universal in text classification: a combination of a simple compressor like gzip with a -nearest-neighbor classifier. Without any training, pre-training or fine-tuning, our method achieves results that are competitive with non-pretrained deep learning methods on six in-distributed datasets. It even outperforms BERT on all five OOD datasets, including four low-resource languages. Our method also performs particularly well in few-shot settings where…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMachine Learning and Data Classification · Topic Modeling · Text and Document Classification Technologies
MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Multi-Head Attention · Attention Is All You Need · Linear Layer · Dense Connections · Attention Dropout · Residual Connection · Weight Decay · WordPiece · Dropout
