ViT-Split: Unleashing the Power of Vision Foundation Models via Efficient Splitting Heads

Yifan Li; Xin Li; Tianqin Li; Wenbin He; Yu Kong; Liu Ren

arXiv:2506.03433·cs.CV·July 22, 2025

ViT-Split: Unleashing the Power of Vision Foundation Models via Efficient Splitting Heads

Yifan Li, Xin Li, Tianqin Li, Wenbin He, Yu Kong, Liu Ren

PDF

Open Access

TL;DR

ViT-Split introduces a novel method for leveraging vision foundation models by splitting their layers into feature extractors and task-specific adapters, significantly reducing training time while maintaining or improving performance.

Contribution

The paper proposes ViT-Split, a new approach that divides VFM layers into low-level feature extractors and task-specific heads, eliminating the need for CNNs and reducing tuning complexity.

Findings

01

Reduces training time up to 4x on various tasks.

02

Achieves comparable or better results than existing adapters.

03

Effectively leverages prior knowledge in VFMs.

Abstract

Vision foundation models (VFMs) have demonstrated remarkable performance across a wide range of downstream tasks. While several VFM adapters have shown promising results by leveraging the prior knowledge of VFMs, we identify two inefficiencies in these approaches. First, the interaction between convolutional neural network (CNN) and VFM backbone triggers early layer gradient backpropagation. Second, existing methods require tuning all components, adding complexity. Besides, these adapters alter VFM features, underutilizing the prior knowledge. To tackle these challenges, we propose a new approach called ViT-Split, based on a key observation: the layers of several VFMs, like DINOv2, can be divided into two distinct components: an extractor for learning low-level features and an adapter for learning task-specific features. Leveraging this insight, we eliminate the CNN branch and introduce…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsDomain Adaptation and Few-Shot Learning · Multimodal Machine Learning Applications · Advanced Neural Network Applications