Revisiting Multi-Task Visual Representation Learning

Shangzhe Di; Zhonghua Zhai; Weidi Xie

arXiv:2601.13886·cs.CV·January 21, 2026

Revisiting Multi-Task Visual Representation Learning

Shangzhe Di, Zhonghua Zhai, Weidi Xie

PDF

Open Access

TL;DR

This paper introduces MTV, a multi-task visual pretraining framework that combines vision-language, self-supervised, and dense spatial objectives, leveraging pseudo-labels from expert models to improve spatial reasoning and semantic understanding.

Contribution

It presents a novel multi-task learning framework that integrates multiple objectives with pseudo-labels, systematically analyzing their interactions and scaling behavior for better visual representations.

Findings

01

MTV achieves state-of-the-art spatial reasoning performance.

02

Multi-task learning enhances both local and global visual understanding.

03

Pseudo-labels from high-capacity models effectively scale supervision.

Abstract

Current visual representation learning remains bifurcated: vision-language models (e.g., CLIP) excel at global semantic alignment but lack spatial precision, while self-supervised methods (e.g., MAE, DINO) capture intricate local structures yet struggle with high-level semantic context. We argue that these paradigms are fundamentally complementary and can be integrated into a principled multi-task framework, further enhanced by dense spatial supervision. We introduce MTV, a multi-task visual pretraining framework that jointly optimizes a shared backbone across vision-language contrastive, self-supervised, and dense spatial objectives. To mitigate the need for manual annotations, we leverage high-capacity "expert" models -- such as Depth Anything V2 and OWLv2 -- to synthesize dense, structured pseudo-labels at scale. Beyond the framework, we provide a systematic investigation into the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Advanced Neural Network Applications