Pruning Large Language Models to Intra-module Low-rank Architecture with   Transitional Activations

Bowen Shen; Zheng Lin; Daren Zha; Wei Liu; Jian Luan; Bin Wang,; Weiping Wang

arXiv:2407.05690·cs.CL·July 9, 2024·1 cites

Pruning Large Language Models to Intra-module Low-rank Architecture with Transitional Activations

Bowen Shen, Zheng Lin, Daren Zha, Wei Liu, Jian Luan, Bin Wang,, Weiping Wang

PDF

Open Access 1 Repo

TL;DR

This paper introduces TransAct, a structured pruning method that reduces large language models into intra-module low-rank architectures by selectively pruning transitional activations, leading to efficient models with minimal performance loss.

Contribution

The paper proposes a novel activation-guided structured pruning approach, TransAct, that effectively compresses LLMs by focusing on intra-module low-rank architectures while maintaining performance.

Findings

01

TransAct achieves high compression ratios with minimal accuracy loss.

02

Pruned models significantly reduce weights, KV cache, and attention computation.

03

Activation-guided iterative pruning enhances redundancy removal in MHA and MLP modules.

Abstract

Structured pruning fundamentally reduces computational and memory overheads of large language models (LLMs) and offers a feasible solution for end-side LLM deployment. Structurally pruned models remain dense and high-precision, highly compatible with further tuning and compression. However, as the coarse-grained structured pruning poses large damage to the highly interconnected model, achieving a high compression ratio for scaled-up LLMs remains a challenge. In this paper, we introduce a task-agnostic structured pruning approach coupled with a compact Transformer architecture design. The proposed approach, named TransAct, reduces transitional activations inside multi-head attention (MHA) and multi-layer perceptron (MLP) modules, while preserving the inter-module activations that are sensitive to perturbations. Hence, the LLM is pruned into an intra-module low-rank architecture,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

sbwww/transact-pruning
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling

MethodsAttention Is All You Need · Softmax · Byte Pair Encoding · Layer Normalization · Linear Layer · Label Smoothing · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Adam · Dropout