Revisiting Model Stitching In the Foundation Model Era
Zheda Mai, Ke Zhang, Fu-En Wang, Zixiao Ken Wang, Albert Y. C. Chen, Lu Xia, Min Sun, Wei-Lun Chao, Cheng-Hao Kuo

TL;DR
This paper systematically investigates the stitchability of heterogeneous vision foundation models, proposing a simple feature-matching method that enables effective model stitching and introduces the VFM Stitch Tree for flexible model integration.
Contribution
It introduces a systematic protocol for model stitching across diverse VFMs, revealing effective simple methods and proposing the VFM Stitch Tree for practical model integration.
Findings
Feature-matching loss enables stitching of heterogeneous VFMs.
Shallow stitch points are more challenging for accuracy retention.
Deep stitch points can surpass individual models with minimal overhead.
Abstract
Model stitching, connecting early layers of one model (source) to later layers of another (target) via a light stitch layer, has served as a probe of representational compatibility. Prior work finds that models trained on the same dataset remain stitchable (negligible accuracy drop) despite different initializations or objectives. We revisit stitching for Vision Foundation Models (VFMs) that vary in objectives, data, and modality mix (e.g., CLIP, DINOv2, SigLIP 2) and ask: Are heterogeneous VFMs stitchable? We introduce a systematic protocol spanning the stitch points, stitch layer families, training losses, and downstream tasks. Three findings emerge. (1) Stitch layer training matters: conventional approaches that match the intermediate features at the stitch point or optimize the task loss end-to-end struggle to retain accuracy, especially at shallow stitch points. (2) With a simple…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Face Recognition and Perception
