Elastic ViTs from Pretrained Models without Retraining
Walter Simoncini, Michael Dorkenwald, Tijmen Blankevoort, Cees G.M. Snoek, Yuki M. Asano

TL;DR
This paper introduces SnapViT, a fast, retraining-free structured pruning method for pretrained Vision Transformers that creates elastic models adaptable to various compute budgets without sacrificing performance.
Contribution
It proposes a novel, efficient pruning strategy using evolutionary algorithms and self-supervised importance scoring for elastic inference in pretrained Vision Transformers.
Findings
Outperforms state-of-the-art pruning methods across various sparsities.
Generates elastic models in less than five minutes on a single A100 GPU.
Maintains high performance without requiring retraining or labeled data.
Abstract
Vision foundation models achieve remarkable performance but are only available in a limited set of pre-determined sizes, forcing sub-optimal deployment choices under real-world constraints. We introduce SnapViT: Single-shot network approximation for pruned Vision Transformers, a new post-pretraining structured pruning method that enables elastic inference across a continuum of compute budgets. Our approach efficiently combines gradient information with cross-network structure correlations, approximated via an evolutionary algorithm, does not require labeled data, generalizes to models without a classification head, and is retraining-free. Experiments on DINO, SigLIPv2, DeIT, and AugReg models demonstrate superior performance over state-of-the-art methods across various sparsities, requiring less than five minutes on a single A100 GPU to generate elastic models that can be adjusted to…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsAdvanced Neural Network Applications · Domain Adaptation and Few-Shot Learning · Multimodal Machine Learning Applications
