Revisiting Data Compression with Language Modeling
Chen-Han Tsai

TL;DR
This paper explores the use of large language models for data compression, achieving state-of-the-art results on enwik9 and extending their application to various data types, highlighting both strengths and limitations.
Contribution
It introduces a new state-of-the-art adjusted compression rate using LLMs without additional training and evaluates their performance across diverse data domains.
Findings
Achieved 18% adjusted compression rate on enwik9
LLMs excel in text-dominant data compression
Competitive performance in non-natural text sequences
Abstract
In this report, we investigate the potential use of large language models (LLM's) in the task of data compression. Previous works have demonstrated promising results in applying LLM's towards compressing not only text, but also a wide range of multi-modal data. Despite the favorable performance achieved, there still remains several practical questions that pose a challenge towards replacing existing data compression algorithms with LLM's. In this work, we explore different methods to achieve a lower adjusted compression rate using LLM's as data compressors. In comparison to previous works, we were able to achieve a new state-of-the-art (SOTA) adjusted compression rate of around on the enwik9 dataset without additional model training. Furthermore, we explore the use of LLM's in compressing non-English data, code data, byte stream sequences. We show that while LLM's excel in…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAlgorithms and Data Compression · Natural Language Processing Techniques · Advanced Data Compression Techniques
