La RoSA: Enhancing LLM Efficiency via Layerwise Rotated Sparse Activation
Kai Liu, Bowen Xu, Shaoyu Wu, Xin Chen, Hao Zhou, Yongliang Tao, Lulu Hu

TL;DR
LaRoSA introduces a layerwise orthogonal rotation technique to sparsify activations in LLMs, enabling efficient inference acceleration with minimal performance loss without additional training or pruning.
Contribution
The paper proposes LaRoSA, a novel activation sparsification method using layerwise rotations that achieves consistent sparsity and speed-up in LLMs without extra training.
Findings
Achieves 1.30x wall-clock speed-up at 40% sparsity for LLaMA2-7B.
Maintains 0.17 perplexity gap with dense models.
Reduces zero-shot task accuracy gap to 0.54%.
Abstract
Activation sparsity can reduce the computational overhead and memory transfers during the forward pass of Large Language Model (LLM) inference. Existing methods face limitations, either demanding time-consuming recovery training that hinders real-world adoption, or relying on empirical magnitude-based pruning, which causes fluctuating sparsity and unstable inference speed-up. This paper introduces LaRoSA (Layerwise Rotated Sparse Activation), a novel method for activation sparsification designed to improve LLM efficiency without requiring additional training or magnitude-based pruning. We leverage layerwise orthogonal rotations to transform input activations into rotated forms that are more suitable for sparsification. By employing a Top-K selection approach within the rotated activations, we achieve consistent model-level sparsity and reliable wall-clock time speed-up. LaRoSA is…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Computational and Text Analysis Methods
