Revisiting Data Compression with Language Modeling

Chen-Han Tsai

arXiv:2601.02875·cs.CL·January 7, 2026

Revisiting Data Compression with Language Modeling

Chen-Han Tsai

PDF

Open Access

TL;DR

This paper explores the use of large language models for data compression, achieving state-of-the-art results on enwik9 and extending their application to various data types, highlighting both strengths and limitations.

Contribution

It introduces a new state-of-the-art adjusted compression rate using LLMs without additional training and evaluates their performance across diverse data domains.

Findings

01

Achieved 18% adjusted compression rate on enwik9

02

LLMs excel in text-dominant data compression

03

Competitive performance in non-natural text sequences

Abstract

In this report, we investigate the potential use of large language models (LLM's) in the task of data compression. Previous works have demonstrated promising results in applying LLM's towards compressing not only text, but also a wide range of multi-modal data. Despite the favorable performance achieved, there still remains several practical questions that pose a challenge towards replacing existing data compression algorithms with LLM's. In this work, we explore different methods to achieve a lower adjusted compression rate using LLM's as data compressors. In comparison to previous works, we were able to achieve a new state-of-the-art (SOTA) adjusted compression rate of around $18%$ on the enwik9 dataset without additional model training. Furthermore, we explore the use of LLM's in compressing non-English data, code data, byte stream sequences. We show that while LLM's excel in…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAlgorithms and Data Compression · Natural Language Processing Techniques · Advanced Data Compression Techniques