Stitched ViTs are Flexible Vision Backbones

Zizheng Pan; Jing Liu; Haoyu He; Jianfei Cai; Bohan Zhuang

arXiv:2307.00154·cs.CV·November 29, 2023

Stitched ViTs are Flexible Vision Backbones

Zizheng Pan, Jing Liu, Haoyu He, Jianfei Cai, Bohan Zhuang

PDF

Open Access 1 Repo

TL;DR

SN-Netv2 introduces an improved model stitching framework for vision Transformers, enabling flexible, efficient, and high-performing backbones adaptable to diverse downstream tasks and performance-efficiency trade-offs.

Contribution

The paper proposes SN-Netv2, a novel model stitching method with a two-way scheme and resource-aware sampling, enhancing flexibility and efficiency of pretrained ViTs for various tasks.

Findings

01

Outperforms SN-Netv1 on dense prediction tasks

02

Achieves better performance-efficiency trade-offs

03

Demonstrates strong adaptability as a flexible backbone

Abstract

Large pretrained plain vision Transformers (ViTs) have been the workhorse for many downstream tasks. However, existing works utilizing off-the-shelf ViTs are inefficient in terms of training and deployment, because adopting ViTs with individual sizes requires separate trainings and is restricted by fixed performance-efficiency trade-offs. In this paper, we are inspired by stitchable neural networks (SN-Net), which is a new framework that cheaply produces a single model that covers rich subnetworks by stitching pretrained model families, supporting diverse performance-efficiency trade-offs at runtime. Building upon this foundation, we introduce SN-Netv2, a systematically improved model stitching framework to facilitate downstream task adaptation. Specifically, we first propose a two-way stitching scheme to enlarge the stitching space. We then design a resource-constrained sampling…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

ziplab/sn-netv2
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Neural Network Applications · Visual Attention and Saliency Detection · Robotics and Sensor-Based Localization