Parameter Reduction Improves Vision Transformers: A Comparative Study of Sharing and Width Reduction

Anantha Padmanaban Krishna Kumar (Boston University)

arXiv:2512.01059·cs.CV·December 2, 2025

Parameter Reduction Improves Vision Transformers: A Comparative Study of Sharing and Width Reduction

Anantha Padmanaban Krishna Kumar (Boston University)

PDF

Open Access

TL;DR

This study demonstrates that reducing parameters in Vision Transformers through sharing and width reduction can improve training stability, inference throughput, and performance, challenging the assumption that larger models are always better.

Contribution

The paper introduces two parameter-reduction strategies for Vision Transformers that maintain or improve accuracy while reducing parameters and increasing training stability.

Findings

01

GroupedMLP achieves 81.47% top-1 accuracy with fewer parameters.

02

ShallowMLP increases inference throughput by 38%.

03

Both methods outperform the baseline in accuracy and stability.

Abstract

Although scaling laws and many empirical results suggest that increasing the size of Vision Transformers often improves performance, model accuracy and training behavior are not always monotonically increasing with scale. Focusing on ViT-B/16 trained on ImageNet-1K, we study two simple parameter-reduction strategies applied to the MLP blocks, each removing 32.7\% of the baseline parameters. Our \emph{GroupedMLP} variant shares MLP weights between adjacent transformer blocks and achieves 81.47\% top-1 accuracy while maintaining the baseline computational cost. Our \emph{ShallowMLP} variant halves the MLP hidden dimension and reaches 81.25\% top-1 accuracy with a 38\% increase in inference throughput. Both models outperform the 86.6M-parameter baseline (81.05\%) and exhibit substantially improved training stability, reducing peak-to-final accuracy degradation from 0.47\% to the range…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Neural Network Applications · Advanced Memory and Neural Computing · CCD and CMOS Imaging Sensors