Gzip versus bag-of-words for text classification
Juri Opitz

TL;DR
This paper compares gzip compression and bag-of-words methods for text classification, demonstrating that bag-of-words can match or outperform gzip in accuracy and efficiency.
Contribution
It shows that bag-of-words approaches are competitive with gzip compression for text classification, offering similar or better results with greater efficiency.
Findings
Bag-of-words achieves comparable or better accuracy than gzip.
Bag-of-words is more computationally efficient.
Bag-of-words can be a preferable alternative to gzip for text classification.
Abstract
The effectiveness of compression in text classification ('gzip') has recently garnered lots of attention. In this note we show that `bag-of-words' approaches can achieve similar or better results, and are more efficient.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAlgorithms and Data Compression · Machine Learning and Data Classification
