FASP: Fast and Accurate Structured Pruning of Large Language Models
Hanyu Hu, Pengxiang Zhao, Ping Li, Yi Zheng, Zhefeng Wang, Xiaoming, Yuan

TL;DR
FASP is a novel structured pruning method for large language models that achieves fast, accurate compression by interlinking layers and efficiently selecting components to prune, significantly reducing model size and inference time.
Contribution
Introduces FASP, a new structured pruning framework that links layers and uses an efficient pruning metric to accelerate large language model compression.
Findings
FASP prunes models like OPT-125M in 17 seconds.
FASP maintains high accuracy and perplexity on downstream tasks.
FASP achieves significant speed-ups on large models like LLaMA-30B.
Abstract
The rapid increase in the size of large language models (LLMs) has significantly escalated their computational and memory demands, posing challenges for efficient deployment, especially on resource-constrained devices. Structured pruning has emerged as an effective model compression method that can reduce these demands while preserving performance. In this paper, we introduce FASP (Fast and Accurate Structured Pruning), a novel structured pruning framework for LLMs that emphasizes both speed and accuracy. FASP employs a distinctive pruning structure that interlinks sequential layers, allowing for the removal of columns in one layer while simultaneously eliminating corresponding rows in the preceding layer without incurring additional performance loss. The pruning metric, inspired by Wanda, is computationally efficient and effectively selects components to prune. Additionally, we propose…
Peer Reviews
Decision·ICLR 2025 Conference Withdrawn Submission
- The paper is clearly written, with well-organized sections detailing the methodology, experiment setup, and results. - FASP aims at achieving efficiency on NVIDIA RTX 4090, making the method more accessible for practical applications.
- While FASP offers practical improvements, its core ideas rely heavily on existing pruning strategies, such as those proposed by Wanda and similar structured pruning frameworks. The novelty primarily lies in the integration of these techniques, which does not constitute sufficient methodological novelty. - Some recent related work is missing. For example, DISP-LLM: Dimension-Independent Structural Pruning for Large Language Models. - The experiment did not include the comparison with many exist
1. Effective Weight Restoration Mechanism To counteract potential accuracy drops from pruning, FASP introduces a restoration mechanism that optimizes the remaining weights, preserving model fidelity. This component reinforces the pruning approach, ensuring that accuracy and perplexity remain competitive across various levels of sparsity, particularly valuable in maintaining generalization for downstream tasks. 2. Relevance to Real-World Applications With its focus on deployment practicality, FA
1. Lack of Memory Savings Analysis The paper primarily reports on pruning speed and latency improvements, omitting a detailed analysis of memory savings. Given that structured pruning is often motivated by reductions in memory footprint, a quantitative comparison with other pruning methods on memory consumption would strengthen the paper’s practical impact claims. Including this metric would provide additional insight into FASP’s viability for memory-constrained deployments. 2. Missed Opportuni
1. The paper is well written and well motivated, I can fully understand the technique details. 2. The column-row corresponding pruning is clever. 3. The experiments are extensive, with enough support for the main claims.
1. **More perplexity results, like on C4.** The comparison of perplexity on WikiText might not be fair. FASP restores the pruned weights on 128 samples from WikiText, making it adapt to WikiText. It's an unfair comparison for methods that don't have this restoration step. I also observe a slightly overfit in Table 2, where FASP's with 10% sparsity can achives a smaller perplexity than the original LLM, i.e. 12.42 vs 12.47. Therefore, I encourage to include more perplexity results, like on C4 (bu
1. The paper is generally well-written, with experiments across different types of models and tasks, showing interesting results. 2. Leveraging the inherent position mapping in matrix multiplication to reduce substructures from rows and columns in the weight matrices is a good idea.
1. The proposed pruning structure does not seem to be universal and needs to be specifically designed for different models (such as OPT and LLaMA mentioned in this paper). 2. Figures 1 and 2 are oversimplified to summarize the characteristics of the proposed method. 3. In terms of Tables 1,3 and 4, the proposed FSAP has only a slight improvement in pruning time and accuracy compared to FLAP. 4. It is inappropriate to use the same title for Tables 3, 5, and 6, which have different purposes. 5. On
FASP, the proposed method, is simple and easy to understand. Despite its simplicity, FASP shows nice performance; FASP achieves the highest accuracy (or the lowest perplexity) in almost all settings while requiring the shortest time for pruning. Combined with the fact that the small model pruned with FASP shows a 16% inference speedup, this simple method can be considered being useful in practical settings.
### Novelty 1. The main weakness of this paper lies in its novelty. This paper proposes (1) formulation, (2) importance metric, and (3) restoration method, but all of these ideas can be found in previous works [1, 2, 3, 4] with slight modifications in some cases. In detail, the pruning of neurons is used in [1,2,3] and the importance metric of FASP is a straightforward modification of Wanda [4]; it just sums up the importance score of weights in each column to measure the importance of the colum
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Speech and dialogue systems
MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings · Pruning · LLaMA · OPT
