Rate-Distortion Optimization for Transformer Inference
Anderson de Andrade, Alon Harell, Ivan V. Baji\'c

TL;DR
This paper introduces a rate-distortion framework for lossy compression of transformer intermediate representations, enabling more efficient inference by balancing bitrate and accuracy.
Contribution
It presents a novel information-theoretic approach to compress transformer representations, providing bounds and insights into their rate-distortion behavior.
Findings
Simple codecs achieve significant rate savings and outperform complex methods.
The rate-distortion behavior of transformers can be characterized and bounded.
The framework enhances understanding of representation coding in transformers.
Abstract
Transformers achieve superior performance on many tasks, but impose heavy compute and memory requirements during inference. This inference can be made more efficient by partitioning the process across multiple devices, which, in turn, requires compressing its intermediate representations. We introduce a principled rate-distortion-based framework for lossy compression that learns compact encodings that explicitly trade bitrate for accuracy. Experiments on language benchmarks show that the simplest of the proposed codecs achieves substantial rate savings, outperforming more complex methods. We characterize and analyze the rate-distortion behaviour of transformers, offering a unified lens for understanding performance in representation coding. This formulation extends information-theoretic concepts to derive bounds on the achievable rate of learnable codecs. For different architectures and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
