Language Modeling Is Compression
Gr\'egoire Del\'etang, Anian Ruoss, Paul-Ambroise Duquenne, Elliot, Catt, Tim Genewein, Christopher Mattern, Jordi Grau-Moya, Li Kevin Wenliang,, Matthew Aitchison, Laurent Orseau, Marcus Hutter, Joel Veness

TL;DR
This paper explores the idea that large language models can be viewed as powerful compressors, providing new insights into their capabilities and enabling the use of general-purpose compressors for generative modeling.
Contribution
It demonstrates that large language models are effective predictors and compressors, offering a novel perspective that links compression with prediction and enables new applications.
Findings
Chinchilla 70B compresses ImageNet and LibriSpeech better than domain-specific compressors.
Large language models exhibit strong general-purpose prediction and compression abilities.
The prediction-compression equivalence allows using standard compressors for generative modeling.
Abstract
It has long been established that predictive models can be transformed into lossless compressors and vice versa. Incidentally, in recent years, the machine learning community has focused on training increasingly large and powerful self-supervised (language) models. Since these large language models exhibit impressive predictive capabilities, they are well-positioned to be strong compressors. In this work, we advocate for viewing the prediction problem through the lens of compression and evaluate the compression capabilities of large (foundation) models. We show that large language models are powerful general-purpose predictors and that the compression viewpoint provides novel insights into scaling laws, tokenization, and in-context learning. For example, Chinchilla 70B, while trained primarily on text, compresses ImageNet patches to 43.4% and LibriSpeech samples to 16.4% of their raw…
Peer Reviews
Decision·ICLR 2024 poster
- The paper presents how large language models pre-trained on text data can be used for compression beyond text data - The authors demonstrate that this approach outperforms several well-established compression methods like gzip, PNG or FLAC in terms of raw compression ratio - The paper provides insights on how different aspects like model size and choice of the tokenizer affects performance. For example, for model size the authors provide empirical scaling laws - The experiments are well descr
- The motivation of this work is rather unclear to me. Is this work about advocating the use of pre-trained large language models as a potential method for compression? If so, how can they be used as such in practice considering their limitations? Or is it about using the compression framework to better understand large language models? If so, why is it interesting to study pre-trained large language models "through the lens of compression"? - The authors mention that they “advocate for using (
1. Novel in the sense of applying LLM to compressed coding of images & audio. 2. Demonstration through resourceful examples.
1. The idea of deep model learning being a compression of natural data is not new, I think this is echoed by the authors too. It has, e.g., been a core and explicit theme in "High Dimensional Data Analysis with Low-Dimensional Models: Principles, Computation, and Applications". As such, shouldn't the paper's title be more specific, such as "LLMs are general-purpose image & audio compressors"? 2. A key into understanding the algorithm is Fig. 1, but the figure contains ambiguities and confusion
1. The paper is well-written and clear to investigate how and why compression and prediction are equivalent. 2. Evaluate large pretrained models used as compressors against various standard compressors and showed that they are competitive, not only on text but also on modalities they have never been trained on, such as images and audio data.
If we discuss the number of parameters in larger language models and how it reflects compression performance, it would be better to investigate the reasons behind this relationship.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Speech Recognition and Synthesis
MethodsChinchilla
