Revisiting Multi-Task Visual Representation Learning
Shangzhe Di, Zhonghua Zhai, Weidi Xie

TL;DR
This paper introduces MTV, a multi-task visual pretraining framework that combines vision-language, self-supervised, and dense spatial objectives, leveraging pseudo-labels from expert models to improve spatial reasoning and semantic understanding.
Contribution
It presents a novel multi-task learning framework that integrates multiple objectives with pseudo-labels, systematically analyzing their interactions and scaling behavior for better visual representations.
Findings
MTV achieves state-of-the-art spatial reasoning performance.
Multi-task learning enhances both local and global visual understanding.
Pseudo-labels from high-capacity models effectively scale supervision.
Abstract
Current visual representation learning remains bifurcated: vision-language models (e.g., CLIP) excel at global semantic alignment but lack spatial precision, while self-supervised methods (e.g., MAE, DINO) capture intricate local structures yet struggle with high-level semantic context. We argue that these paradigms are fundamentally complementary and can be integrated into a principled multi-task framework, further enhanced by dense spatial supervision. We introduce MTV, a multi-task visual pretraining framework that jointly optimizes a shared backbone across vision-language contrastive, self-supervised, and dense spatial objectives. To mitigate the need for manual annotations, we leverage high-capacity "expert" models -- such as Depth Anything V2 and OWLv2 -- to synthesize dense, structured pseudo-labels at scale. Beyond the framework, we provide a systematic investigation into the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Advanced Neural Network Applications
