Stitched ViTs are Flexible Vision Backbones
Zizheng Pan, Jing Liu, Haoyu He, Jianfei Cai, Bohan Zhuang

TL;DR
SN-Netv2 introduces an improved model stitching framework for vision Transformers, enabling flexible, efficient, and high-performing backbones adaptable to diverse downstream tasks and performance-efficiency trade-offs.
Contribution
The paper proposes SN-Netv2, a novel model stitching method with a two-way scheme and resource-aware sampling, enhancing flexibility and efficiency of pretrained ViTs for various tasks.
Findings
Outperforms SN-Netv1 on dense prediction tasks
Achieves better performance-efficiency trade-offs
Demonstrates strong adaptability as a flexible backbone
Abstract
Large pretrained plain vision Transformers (ViTs) have been the workhorse for many downstream tasks. However, existing works utilizing off-the-shelf ViTs are inefficient in terms of training and deployment, because adopting ViTs with individual sizes requires separate trainings and is restricted by fixed performance-efficiency trade-offs. In this paper, we are inspired by stitchable neural networks (SN-Net), which is a new framework that cheaply produces a single model that covers rich subnetworks by stitching pretrained model families, supporting diverse performance-efficiency trade-offs at runtime. Building upon this foundation, we introduce SN-Netv2, a systematically improved model stitching framework to facilitate downstream task adaptation. Specifically, we first propose a two-way stitching scheme to enlarge the stitching space. We then design a resource-constrained sampling…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications · Visual Attention and Saliency Detection · Robotics and Sensor-Based Localization
