Post-Training Statistical Calibration for Higher Activation Sparsity
Vui Seng Chua, Yujie Pan, Nilesh Jain

TL;DR
SCAP is a post-training activation pruning method that enhances sparsity and speed in large language models by calibrating activation distributions, achieving significant efficiency gains across various Transformer architectures.
Contribution
Introduces a novel post-training calibration framework, SCAP, that generalizes activation sparsification for Transformers and improves decoding speed without retraining.
Findings
Achieves 1.5x speedup over CATS at same model quality.
Effectively applied across diverse Transformer models including MoE and pre-quantized models.
Demonstrates robustness and scalability of the method.
Abstract
We present Statistical Calibrated Activation Pruning (SCAP), a post-training activation pruning framework that (1) generalizes sparsification by input activations of Fully-Connected layers for generic and flexible application across Transformers, and (2) features a simple Mode-Centering technique to pre-calibrate activation distributions for maximizing post-training sparsity. Our results demonstrate robust Pareto efficiency compared to prior methods, translating to a 1.5x additional LLM decoding speedup against CATS at iso model quality. SCAP effectiveness is empirically verified across a wide range of models, including recent Transformer Decoders, MoE, Mamba2, Encoding Transformer, and pre-quantized models, highlighting its practicality and scalability. The code is available at: https://github.com/IntelLabs/SCAP.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsIndustrial Vision Systems and Defect Detection · Advanced Measurement and Metrology Techniques · Advanced X-ray and CT Imaging
MethodsAttention Is All You Need · Adam · Dropout · Position-Wise Feed-Forward Layer · Softmax · Dense Connections · Byte Pair Encoding · Linear Layer · Multi-Head Attention · Label Smoothing
