Adaptive MLP Pruning for Large Vision Transformers
Chengchao Shen

TL;DR
This paper introduces an adaptive pruning method for large vision transformers that significantly reduces parameters and computational costs while maintaining performance, by evaluating neuron importance with a novel label-free criterion.
Contribution
The proposed AMP method adaptively prunes MLP modules in vision transformers using a new importance evaluation, outperforming existing pruning techniques without needing predefined compression ratios.
Findings
Achieves roughly 40% reduction in parameters and FLOPs.
Maintains near-original performance after pruning.
Outperforms other pruning methods without fine-tuning.
Abstract
Large vision transformers present impressive scalability, as their performance can be well improved with increased model capacity. Nevertheless, their cumbersome parameters results in exorbitant computational and memory demands. By analyzing prevalent transformer structures, we find that multilayer perceptron (MLP) modules constitute the largest share of the model's parameters. In this paper, we propose an Adaptive MLP Pruning (AMP) method to substantially reduce the parameters of large vision transformers without obvious performance degradation. First, we adopt Taylor based method to evaluate neuron importance of MLP. However, the importance computation using one-hot cross entropy loss ignores the potential predictions on other categories, thus degrading the quality of the evaluated importance scores. To address this issue, we introduce label-free information entropy criterion to fully…
Peer Reviews
Decision·ICLR 2026 Conference Withdrawn Submission
The paper well written, the presentation of the idea is clear.
The novelty seems limited, where the core idea is to replace standard cross entropy-based gradient used in taylor-based importance score with information entropy which is label-agnostic, and the rest of it which is binary search become quite obvious. The method proposed therefore seems simple and straightforward, which then requires comprehensive experiments to justify the generalizability of it. However, the experiment discussions are not very comprehensive.
This paper is clearly written and well organized.
I have concerns in the following perspectives: - The proposed criterion is essentially measuring the sample wise feature similarity among a batch of images. Larger entropy will mean more diverse output images. The reviewer is not sure why using feature similarity is a good criterion. How would pruning affect this information entropy value? Will it go up or down? Based on the formula $\epsilon_t - \epsilon_0 < \delta \epsilon$, it seems to indicate that the information entropy value will go up, m
1. The experimental evaluation is extensive, spanning several large-scale vision transformers and diverse tasks, demonstrating consistent compression and performance trends. 2. The ablation studies are well organized and effectively disentangle the roles of the entropy criterion, pruning strategy, and threshold parameters. 3. The presentation is clear overall, with a logical flow and visual aids that make the method and procedure easy to follow. 4. The proposed approach is label-free and inde
1. The evaluation is dominated by CLIP-style models, with only one experiment on DINOv2, leaving limited evidence of generalization beyond contrastive vision frameworks. 2. The work lacks experiments on standard supervised classification models, which makes it unclear how the method performs under typical vision transformer training settings. 3. The evaluation is confined to vision tasks, while recent pruning research increasingly focuses on large language models, limiting the relevance to bro
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Memory and Neural Computing · Advanced Neural Network Applications · Image Enhancement Techniques
