Adaptive Pruning of Pretrained Transformer via Differential Inclusions
Yizhuo Ding, Ke Fan, Yikai Wang, Xinwei Sun, Yanwei Fu

TL;DR
This paper introduces SPP, a flexible transformer pruning method that generates a range of sparsity levels in a single stage using differential inclusions, reducing computational costs and enabling customizable model compression.
Contribution
The paper presents a novel single-stage pruning approach for transformers using differential inclusions, allowing for multiple sparsity levels without multiple pruning processes.
Findings
SPP effectively produces various sparsity levels in one process.
The method maintains model performance across different compression ratios.
Extensive experiments validate the approach on multiple transformer architectures.
Abstract
Large transformers have demonstrated remarkable success, making it necessary to compress these models to reduce inference costs while preserving their perfor-mance. Current compression algorithms prune transformers at fixed compression ratios, requiring a unique pruning process for each ratio, which results in high computational costs. In contrast, we propose pruning of pretrained transformers at any desired ratio within a single pruning stage, based on a differential inclusion for a mask parameter. This dynamic can generate the whole regularization solution path of the mask parameter, whose support set identifies the network structure. Therefore, the solution path identifies a Transformer weight family with various sparsity levels, offering greater flexibility and customization. In this paper, we introduce such an effective pruning method, termed SPP (Solution Path Pruning). To achieve…
Peer Reviews
Decision·ICLR 2025 Poster
1. This paper is motivated well. The method for any desired pruning ratio is needed in many real-world applications. And the proposed method is able to achieve this. 2. This paper identifies the limitation of Lasso and develops a differential inclusion-based method to achieve various compression ratio pruning. 3. There is a sound theoretical analysis to guarantee the global convergence of the method. A detailed proof is included in the appendix. 4. Experimental results are strong. Many experime
1. Although the authors claim the proposed method significantly reducing the cost of model pruning. The training cost is not reported in this paper. It is better to introduce how long the search stage is. And make a comparison for training cost between different methods. 2. The ablation studies are weak. Experiments demonstrate the strong performance of the proposed SPP. But it is hard for the reader to figure out why the proposed method is effective. More ablation studies are needed. For exampl
**Adaptive Solution Path for Pruning** - Unlike traditional mask-based pruning, SPP generates models with different sparsity levels in a single pruning run, allowing for a Transformer Weight Family adaptable to varying hardware or performance needs without retraining. **Flexible, Fine-Grained Pruning Strategy** - SPP’s pair-wise shared mask strategy applies pruning at the smallest functional units within transformers (e.g., query-key and value-output pairs), allowing for greater flexi
**Marginal Improvement Over Existing Methods** - Although the method introduces adaptive pruning, it does not fundamentally change the mask-based pruning paradigm. The improvements, while novel in terms of execution, may appear incremental compared to existing mask-based and structural pruning strategies. **Lack of Broad Comparison with Other Mask-Based Methods** - The paper does not provide an in-depth comparison with other advanced mask-based pruning techniques, making it difficult
1. Clear motivation. 2. The theoretical proof and experiment are quite sufficient.
1. The results of ablation studies are insufficient to demonstrate the effectiveness of the proposed method for the following reasons: 1) For DeiT-Small and Swin-Tiny models, the proposed SPP achieves higher accuracy with more parameters and higher or equal FLOPS, which does not indicate that SPP is superior. 2) Conducting experiments solely with the DessiLBI method lacks generalizability. 2. Figure 1 and tables lack legends and comments. 3. "FLOPs" is generally not written as "Flops" when used
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsPower Quality and Harmonics · Magnetic Properties and Applications
MethodsAttention Is All You Need · Absolute Position Encodings · Softmax · Linear Layer · Adam · Residual Connection · Dropout · Sparse Evolutionary Training · Multi-Head Attention · Position-Wise Feed-Forward Layer
