Adaptive MLP Pruning for Large Vision Transformers

Chengchao Shen

arXiv:2603.08100·cs.CV·March 10, 2026

Adaptive MLP Pruning for Large Vision Transformers

Chengchao Shen

PDF

Open Access 3 Reviews

TL;DR

This paper introduces an adaptive pruning method for large vision transformers that significantly reduces parameters and computational costs while maintaining performance, by evaluating neuron importance with a novel label-free criterion.

Contribution

The proposed AMP method adaptively prunes MLP modules in vision transformers using a new importance evaluation, outperforming existing pruning techniques without needing predefined compression ratios.

Findings

01

Achieves roughly 40% reduction in parameters and FLOPs.

02

Maintains near-original performance after pruning.

03

Outperforms other pruning methods without fine-tuning.

Abstract

Large vision transformers present impressive scalability, as their performance can be well improved with increased model capacity. Nevertheless, their cumbersome parameters results in exorbitant computational and memory demands. By analyzing prevalent transformer structures, we find that multilayer perceptron (MLP) modules constitute the largest share of the model's parameters. In this paper, we propose an Adaptive MLP Pruning (AMP) method to substantially reduce the parameters of large vision transformers without obvious performance degradation. First, we adopt Taylor based method to evaluate neuron importance of MLP. However, the importance computation using one-hot cross entropy loss ignores the potential predictions on other categories, thus degrading the quality of the evaluated importance scores. To address this issue, we introduce label-free information entropy criterion to fully…

Peer Reviews

Decision·ICLR 2026 Conference Withdrawn Submission

Reviewer 01Rating 4Confidence 3

Strengths

The paper well written, the presentation of the idea is clear.

Weaknesses

The novelty seems limited, where the core idea is to replace standard cross entropy-based gradient used in taylor-based importance score with information entropy which is label-agnostic, and the rest of it which is binary search become quite obvious. The method proposed therefore seems simple and straightforward, which then requires comprehensive experiments to justify the generalizability of it. However, the experiment discussions are not very comprehensive.

Reviewer 02Rating 2Confidence 5

Strengths

This paper is clearly written and well organized.

Weaknesses

I have concerns in the following perspectives: - The proposed criterion is essentially measuring the sample wise feature similarity among a batch of images. Larger entropy will mean more diverse output images. The reviewer is not sure why using feature similarity is a good criterion. How would pruning affect this information entropy value? Will it go up or down? Based on the formula $\epsilon_t - \epsilon_0 < \delta \epsilon$, it seems to indicate that the information entropy value will go up, m

Reviewer 03Rating 6Confidence 3

Strengths

1. The experimental evaluation is extensive, spanning several large-scale vision transformers and diverse tasks, demonstrating consistent compression and performance trends. 2. The ablation studies are well organized and effectively disentangle the roles of the entropy criterion, pruning strategy, and threshold parameters. 3. The presentation is clear overall, with a logical flow and visual aids that make the method and procedure easy to follow. 4. The proposed approach is label-free and inde

Weaknesses

1. The evaluation is dominated by CLIP-style models, with only one experiment on DINOv2, leaving limited evidence of generalization beyond contrastive vision frameworks. 2. The work lacks experiments on standard supervised classification models, which makes it unclear how the method performs under typical vision transformer training settings. 3. The evaluation is confined to vision tasks, while recent pruning research increasingly focuses on large language models, limiting the relevance to bro

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Memory and Neural Computing · Advanced Neural Network Applications · Image Enhancement Techniques