MoDeGPT: Modular Decomposition for Large Language Model Compression

Chi-Heng Lin; Shangqian Gao; James Seale Smith; Abhishek Patel,; Shikhar Tuli; Yilin Shen; Hongxia Jin; Yen-Chang Hsu

arXiv:2408.09632·cs.LG·May 5, 2025

MoDeGPT: Modular Decomposition for Large Language Model Compression

Chi-Heng Lin, Shangqian Gao, James Seale Smith, Abhishek Patel,, Shikhar Tuli, Yilin Shen, Hongxia Jin, Yen-Chang Hsu

PDF

Open Access 3 Reviews

TL;DR

MoDeGPT introduces a structured, module-based compression method for large language models that avoids fine-tuning, significantly reduces computational costs, and maintains high performance at substantial compression rates.

Contribution

It presents a novel decomposition framework for LLM compression that does not require gradient-based fine-tuning, using matrix decomposition algorithms applied to transformer modules.

Findings

01

Achieves 98% compute savings in compressing a 13B model.

02

Maintains 90-95% zero-shot performance at 25-30% compression.

03

Increases inference throughput by up to 46%.

Abstract

Large Language Models (LLMs) have reshaped the landscape of artificial intelligence by demonstrating exceptional performance across various tasks. However, substantial computational requirements make their deployment challenging on devices with limited resources. Recently, compression methods using low-rank matrix techniques have shown promise, yet these often lead to degraded accuracy or introduce significant overhead in parameters and inference latency. This paper introduces \textbf{Mo}dular \textbf{De}composition (MoDeGPT), a novel structured compression framework that does not need recovery fine-tuning while resolving the above drawbacks. MoDeGPT partitions the Transformer block into modules comprised of matrix pairs and reduces the hidden dimensions via reconstructing the module-level outputs. MoDeGPT is developed based on a theoretical framework that utilizes three…

Peer Reviews

Decision·ICLR 2025 Oral

Reviewer 01Rating 8Confidence 3

Strengths

1. The authors propose the interesting idea of using three different matrix decomposition algorithms to compress computations in both MLP and Attention. 2. Experimental results demonstrate that the proposed method offers advantages in terms of both performance and efficiency compared to prior pruning and matrix decomposition algorithms. 3. The Appendix includes additional methods and experiments related to group-query attention.

Weaknesses

1. The authors suggest using three different types of matrix decompositions for three different types of computations within Transformers, but they do not provide motivation for this choice. For example, why is CR decomposition more suitable for Type-2 computation?

Reviewer 02Rating 8Confidence 2

Strengths

To the best of my knowledge, the method of structured approximations across multiple matrices is novel and the results are strong. For the most part, the paper is well-written.

Weaknesses

One weakness is the lack of justification for the approximation methods for each weight group. Could you give more intuition behind why each method was chosen? For example, the sentence "Since $W_U$ is inside a nonlinear function $\sigma_s$, we constrain the search space for its approximation to a matrix multiplication $W_U S_k$ for tractability, where $S_k$ is the $k$-column selection matrix" (line 244) only describes the approximation, whereas a justification would explain why Nystrom is a bet

Reviewer 03Rating 8Confidence 4

Strengths

This paper has diverse strengths and I summarize them as follows: ### Method 1. The authors introduce Nystrom approximation, CR decomposition, and SVD to pruning row-column pairs in LLMs. To the best of my knowledge, this is the first work to use Nystrom approximation and CR decomposition to prune LLMs. The authors carefully use them to prune different types of modules. 2. The authors propose a novel global sparsity allocation algorithm with entropic regularization. If this algorithm contribut

Weaknesses

### Method 1. In the caption of Figure 1, the authors insist that their new pruning structure avoids the need for extra adapters. However, SliceGPT's adapters are introduced to improve accuracy and can be removed for inferencing without (dimensional) errors. Therefore, that statement should be modified. 2. The main contribution of this paper is introducing diverse decomposition algorithms and applying them to the proper modules. However, there are lack of explanations of the characteristics of

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling · Speech Recognition and Synthesis

MethodsLinear Layer · Residual Connection · Layer Normalization · Multi-Head Attention · Position-Wise Feed-Forward Layer · Adam · Attention Is All You Need · Byte Pair Encoding · Absolute Position Encodings · Softmax