Amber Pruner: Leveraging N:M Activation Sparsity for Efficient Prefill in Large Language Models

Tai An; Ruwu Cai; Yanzhe Zhang; Yang Liu; Hao Chen; Pengcheng Xie; Sheng Chang; Yiwu Yao; and Gongyi Wang

arXiv:2508.02128·cs.LG·August 5, 2025

Amber Pruner: Leveraging N:M Activation Sparsity for Efficient Prefill in Large Language Models

Tai An, Ruwu Cai, Yanzhe Zhang, Yang Liu, Hao Chen, Pengcheng Xie, Sheng Chang, Yiwu Yao, and Gongyi Wang

PDF

Open Access

TL;DR

Amber Pruner introduces a training-free activation sparsity method for large language models that accelerates inference without retraining, maintaining performance across tasks and enabling efficient model deployment.

Contribution

The paper presents Amber Pruner, a novel training-free N:M activation sparsity technique for LLM prefill, and introduces Outstanding-sparse, a framework combining sparsity with quantization for improved efficiency.

Findings

01

Sparsifies over 55% of linear computations without retraining.

02

Effectively accelerates inference in multiple LLMs across various sparsity ratios.

03

Maintains strong downstream task performance, especially in generative tasks.

Abstract

In the era of large language models (LLMs), N:M sparsity has emerged as a structured compression technique critical for accelerating inference. While prior work has primarily focused on weight sparsity, it often suffers from significant accuracy degradation. Activation sparsity, though promising, is typically training-dependent and faces challenges in generalization. To address these limitations, we introduce Amber Pruner, a training-free N:M activation sparsity method designed specifically for the prefill stage, targeting the acceleration of linear projection layers in LLMs. Extensive experiments across multiple models and sparsity ratios (2:4, 4:8, and 8:16) demonstrate that Amber Pruner can effectively sparsify and accelerate more than 55% of linear computations without requiring model retraining. To further enhance generality and efficiency, we propose Outstanding-sparse, a unified…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Generative Adversarial Networks and Image Synthesis · Multimodal Machine Learning Applications