PRAC: Principal-Random Subspace for LLM Activation Compression and Memory-Efficient Training
Yanyi Li, Yimu Zhang, Cong Fang

TL;DR
PRAC introduces a novel activation compression method for large language models that combines spectral decomposition and random sampling to significantly reduce memory usage while maintaining model performance.
Contribution
The paper proposes PRAC, a new spectral-based activation compression technique that achieves unbiased, low-variance estimates and substantial memory savings in LLM training.
Findings
Up to 36% memory reduction in training large models
Negligible performance degradation with PRAC
Minimal additional computational cost
Abstract
Activations have become the primary memory bottleneck in large-batch LLM training. However, existing compression methods fail to exploit the spectral structure of activations, resulting in slow convergence or limited compression. To address this, we bridge the relationship between the algorithm's fast convergence and the requirements for subspace projection, and show that an effective compression should yield an unbiased estimate of the original activation with low variance. We propose Principal-Random Subspace for LLM Activation Compression (PRAC), which novelly decomposes activations into two components: a principal subspace captured via SVD to retain dominant information, and a random subspace sampled from the orthogonal complement to approximate the tail. By introducing a precise scaling factor, we prove that PRAC yields an unbiased gradient estimator with minimum variance under…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDomain Adaptation and Few-Shot Learning · Advanced Neural Network Applications · Speech Recognition and Synthesis
