ESPACE: Dimensionality Reduction of Activations for Model Compression
Charbel Sakr, Brucek Khailany

TL;DR
ESPACE introduces a novel activation-based dimensionality reduction method for large language model compression, achieving significant size reduction with minimal accuracy loss and improved inference speed.
Contribution
ESPACE is the first to leverage activation projection onto principal components for LLM compression, enabling retraining without loss of expressivity and efficient inference.
Findings
Achieves 50% model compression with minimal perplexity increase.
Outperforms baseline models at 20-40% compression rates.
Reduces inference latency and GEMM execution time.
Abstract
We propose ESPACE, an LLM compression technique based on dimensionality reduction of activations. Unlike prior works on weight-centric tensor decomposition, ESPACE projects activations onto a pre-calibrated set of principal components. The activation-centrality of the approach enables retraining LLMs with no loss of expressivity; while at inference, weight decomposition is obtained as a byproduct of matrix multiplication associativity. Theoretical results on the construction of projection matrices with optimal computational accuracy are provided. Experimentally, we find ESPACE enables 50% compression of GPT3, Llama2, and Nemotron4 models with small accuracy degradation, as low as a 0.18 perplexity increase on GPT3-22B. At lower compression rates of 20% to 40%, ESPACE drives GPT3 models to outperforming their baseline, by up to a 0.38 decrease in perplexity for GPT3-8B. ESPACE also…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsReal-time simulation and control systems · Embedded Systems Design Techniques · Parallel Computing and Optimization Techniques
MethodsSparse Evolutionary Training
