Pruning Large Language Models to Intra-module Low-rank Architecture with Transitional Activations
Bowen Shen, Zheng Lin, Daren Zha, Wei Liu, Jian Luan, Bin Wang,, Weiping Wang

TL;DR
This paper introduces TransAct, a structured pruning method that reduces large language models into intra-module low-rank architectures by selectively pruning transitional activations, leading to efficient models with minimal performance loss.
Contribution
The paper proposes a novel activation-guided structured pruning approach, TransAct, that effectively compresses LLMs by focusing on intra-module low-rank architectures while maintaining performance.
Findings
TransAct achieves high compression ratios with minimal accuracy loss.
Pruned models significantly reduce weights, KV cache, and attention computation.
Activation-guided iterative pruning enhances redundancy removal in MHA and MLP modules.
Abstract
Structured pruning fundamentally reduces computational and memory overheads of large language models (LLMs) and offers a feasible solution for end-side LLM deployment. Structurally pruned models remain dense and high-precision, highly compatible with further tuning and compression. However, as the coarse-grained structured pruning poses large damage to the highly interconnected model, achieving a high compression ratio for scaled-up LLMs remains a challenge. In this paper, we introduce a task-agnostic structured pruning approach coupled with a compact Transformer architecture design. The proposed approach, named TransAct, reduces transitional activations inside multi-head attention (MHA) and multi-layer perceptron (MLP) modules, while preserving the inter-module activations that are sensitive to perturbations. Hence, the LLM is pruned into an intra-module low-rank architecture,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling
MethodsAttention Is All You Need · Softmax · Byte Pair Encoding · Layer Normalization · Linear Layer · Label Smoothing · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Adam · Dropout
