DuoGPT: Training-free Dual Sparsity through Activation-aware Pruning in LLMs
Ruokai Yin, Yuhang Li, Donghyun Lee, Priyadarshini Panda

TL;DR
DuoGPT introduces a training-free dual sparsity framework for large language models that combines weight pruning with activation sparsity, improving efficiency while maintaining accuracy.
Contribution
It extends the OBC framework with activation-aware calibration and GPU-optimized solutions for scalable, dual-sparse LLM workloads.
Findings
Outperforms state-of-the-art pruning methods by up to 9.17% accuracy
Achieves 1.39× speedup over dense models
Scalable to billion-parameter LLMs
Abstract
Large language models (LLMs) deliver strong performance but are difficult to deploy due to high memory and compute costs. While pruning reduces these demands, most methods ignore activation sparsity observed at runtime. We reinterpret activation sparsity as dynamic structured weight sparsity and propose DuoGPT, a unified framework that constructs dual-sparse (spMspV) workloads by combining unstructured weight pruning with activation sparsity. To preserve accuracy, we extend the Optimal Brain Compression (OBC) framework with activation-aware calibration and introduce output residuals from the dense model as correction terms. We further optimize the solution for efficient GPU execution, enabling scalability to billion-parameter LLMs. Evaluations on LLaMA-2 and LLaMA-3 show that DuoGPT outperforms state-of-the-art structured pruning methods by up to 9.17% accuracy at an iso-speedup of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsData Stream Mining Techniques · Mobile Crowdsensing and Crowdsourcing · Reinforcement Learning in Robotics
MethodsPruning
