ESPACE: Dimensionality Reduction of Activations for Model Compression

Charbel Sakr; Brucek Khailany

arXiv:2410.05437·cs.LG·October 10, 2024

ESPACE: Dimensionality Reduction of Activations for Model Compression

Charbel Sakr, Brucek Khailany

PDF

Open Access 1 Video

TL;DR

ESPACE introduces a novel activation-based dimensionality reduction method for large language model compression, achieving significant size reduction with minimal accuracy loss and improved inference speed.

Contribution

ESPACE is the first to leverage activation projection onto principal components for LLM compression, enabling retraining without loss of expressivity and efficient inference.

Findings

01

Achieves 50% model compression with minimal perplexity increase.

02

Outperforms baseline models at 20-40% compression rates.

03

Reduces inference latency and GEMM execution time.

Abstract

We propose ESPACE, an LLM compression technique based on dimensionality reduction of activations. Unlike prior works on weight-centric tensor decomposition, ESPACE projects activations onto a pre-calibrated set of principal components. The activation-centrality of the approach enables retraining LLMs with no loss of expressivity; while at inference, weight decomposition is obtained as a byproduct of matrix multiplication associativity. Theoretical results on the construction of projection matrices with optimal computational accuracy are provided. Experimentally, we find ESPACE enables 50% compression of GPT3, Llama2, and Nemotron4 models with small accuracy degradation, as low as a 0.18 perplexity increase on GPT3-22B. At lower compression rates of 20% to 40%, ESPACE drives GPT3 models to outperforming their baseline, by up to a 0.38 decrease in perplexity for GPT3-8B. ESPACE also…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

ESPACE: Dimensionality Reduction of Activations for Model Compression· slideslive

Taxonomy

TopicsReal-time simulation and control systems · Embedded Systems Design Techniques · Parallel Computing and Optimization Techniques

MethodsSparse Evolutionary Training