Language Modeling Is Compression

Gr\'egoire Del\'etang; Anian Ruoss; Paul-Ambroise Duquenne; Elliot; Catt; Tim Genewein; Christopher Mattern; Jordi Grau-Moya; Li Kevin Wenliang,; Matthew Aitchison; Laurent Orseau; Marcus Hutter; Joel Veness

arXiv:2309.10668·cs.LG·March 20, 2024·27 cites

Language Modeling Is Compression

Gr\'egoire Del\'etang, Anian Ruoss, Paul-Ambroise Duquenne, Elliot, Catt, Tim Genewein, Christopher Mattern, Jordi Grau-Moya, Li Kevin Wenliang,, Matthew Aitchison, Laurent Orseau, Marcus Hutter, Joel Veness

PDF

Open Access 1 Repo 3 Reviews

TL;DR

This paper explores the idea that large language models can be viewed as powerful compressors, providing new insights into their capabilities and enabling the use of general-purpose compressors for generative modeling.

Contribution

It demonstrates that large language models are effective predictors and compressors, offering a novel perspective that links compression with prediction and enables new applications.

Findings

01

Chinchilla 70B compresses ImageNet and LibriSpeech better than domain-specific compressors.

02

Large language models exhibit strong general-purpose prediction and compression abilities.

03

The prediction-compression equivalence allows using standard compressors for generative modeling.

Abstract

It has long been established that predictive models can be transformed into lossless compressors and vice versa. Incidentally, in recent years, the machine learning community has focused on training increasingly large and powerful self-supervised (language) models. Since these large language models exhibit impressive predictive capabilities, they are well-positioned to be strong compressors. In this work, we advocate for viewing the prediction problem through the lens of compression and evaluate the compression capabilities of large (foundation) models. We show that large language models are powerful general-purpose predictors and that the compression viewpoint provides novel insights into scaling laws, tokenization, and in-context learning. For example, Chinchilla 70B, while trained primarily on text, compresses ImageNet patches to 43.4% and LibriSpeech samples to 16.4% of their raw…

Peer Reviews

Decision·ICLR 2024 poster

Reviewer 01Rating 6· marginally above the acceptance thresholdConfidence 3

Strengths

- The paper presents how large language models pre-trained on text data can be used for compression beyond text data - The authors demonstrate that this approach outperforms several well-established compression methods like gzip, PNG or FLAC in terms of raw compression ratio - The paper provides insights on how different aspects like model size and choice of the tokenizer affects performance. For example, for model size the authors provide empirical scaling laws - The experiments are well descr

Weaknesses

- The motivation of this work is rather unclear to me. Is this work about advocating the use of pre-trained large language models as a potential method for compression? If so, how can they be used as such in practice considering their limitations? Or is it about using the compression framework to better understand large language models? If so, why is it interesting to study pre-trained large language models "through the lens of compression"? - The authors mention that they “advocate for using (

Reviewer 02Rating 6· marginally above the acceptance thresholdConfidence 3

Strengths

1. Novel in the sense of applying LLM to compressed coding of images & audio. 2. Demonstration through resourceful examples.

Weaknesses

1. The idea of deep model learning being a compression of natural data is not new, I think this is echoed by the authors too. It has, e.g., been a core and explicit theme in "High Dimensional Data Analysis with Low-Dimensional Models: Principles, Computation, and Applications". As such, shouldn't the paper's title be more specific, such as "LLMs are general-purpose image & audio compressors"? 2. A key into understanding the algorithm is Fig. 1, but the figure contains ambiguities and confusion

Reviewer 03Rating 6· marginally above the acceptance thresholdConfidence 3

Strengths

1. The paper is well-written and clear to investigate how and why compression and prediction are equivalent. 2. Evaluate large pretrained models used as compressors against various standard compressors and showed that they are competitive, not only on text but also on modalities they have never been trained on, such as images and audio data.

Weaknesses

If we discuss the number of parameters in larger language models and how it reflects compression performance, it would be better to investigate the reasons behind this relationship.

Code & Models

Repositories

google-deepmind/language_modeling_is_compression
jaxOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling · Speech Recognition and Synthesis

MethodsChinchilla