Understanding the Transfer Limits of Vision Foundation Models

Shiqi Huang; Yipei Wang; Natasha Thorley; Alexander Ng; Shaheer Saeed; Mark Emberton; Shonit Punwani; Veeru Kasivisvanathan; Dean Barratt; Daniel Alexander; Yipeng Hu

arXiv:2601.15888·cs.CV·January 23, 2026

Understanding the Transfer Limits of Vision Foundation Models

Shiqi Huang, Yipei Wang, Natasha Thorley, Alexander Ng, Shaheer Saeed, Mark Emberton, Shonit Punwani, Veeru Kasivisvanathan, Dean Barratt, Daniel Alexander, Yipeng Hu

PDF

Open Access

TL;DR

This paper investigates why vision foundation models often underperform on downstream tasks, highlighting the importance of aligning pretraining objectives with specific task requirements, demonstrated through clinical imaging applications.

Contribution

It provides empirical evidence that better alignment between pretraining and downstream tasks improves transfer performance and convergence in vision foundation models.

Findings

01

Alignment between pretraining and downstream tasks correlates with performance gains.

02

Task-specific pretraining objectives enhance transfer learning effectiveness.

03

Faster convergence observed with better task alignment.

Abstract

Foundation models leverage large-scale pretraining to capture extensive knowledge, demonstrating generalization in a wide range of language tasks. By comparison, vision foundation models (VFMs) often exhibit uneven improvements across downstream tasks, despite substantial computational investment. We postulate that this limitation arises from a mismatch between pretraining objectives and the demands of downstream vision-and-imaging tasks. Pretraining strategies like masked image reconstruction or contrastive learning shape representations for tasks such as recovery of generic visual patterns or global semantic structures, which may not align with the task-specific requirements of downstream applications including segmentation, classification, or image synthesis. To investigate this in a concrete real-world clinical area, we assess two VFMs, a reconstruction-focused MAE-based model…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsDomain Adaptation and Few-Shot Learning · Neurobiology of Language and Bilingualism · Multimodal Machine Learning Applications