Activation Compression in LLMs: Theoretical Analysis and Efficient Algorithm
Wen-Da Wei, Han-Bin Fang, Yang-Di Liu, Jiang-Xin Shi, James Kwok, Yu-Feng Li

TL;DR
This paper presents a theoretical framework for activation compression in large language models, demonstrating its safety for linear operators, and introduces a new co-compression method validated through extensive experiments.
Contribution
The work develops a theory for activation compression in LLMs, proposes a novel co-compression algorithm, and validates it with experiments on multiple models and benchmarks.
Findings
Activation compression is safe for linear operators when unbiased.
The proposed co-compression method reuses low-rank activation factors.
Experimental results show competitive accuracy and compression efficiency.
Abstract
Training large language models (LLMs) is highly memory-intensive, as training must store not only weights and optimizer states but also intermediate activations for backpropagation. While existing memory-efficient methods largely focus on gradients and optimizer states, activation compression is less well established due to the lack of LLM-tailored theory and guarantees. In this work, we develop a theoretical framework showing that activation compression is safe for linear operators when activation compression is unbiased, but problematic for nonlinear ones. We further derive gradient variance bound and establish convergence guarantees for applying activation compression to all linear operators under the standard -smoothness assumption, showing that it does not change the convergence rate. Guided by the theory, we propose an activation-gradient co-compression method that reuses…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
