VersaViT: Enhancing MLLM Vision Backbones via Task-Guided Optimization

Yikun Liu; Yuan Liu; Shangzhe Di; Haicheng Wang; Zhongyin Zhao; Le Tian; Xiao Zhou; Jie Zhou; Jiangchao Yao; Yanfeng Wang; Weidi Xie

arXiv:2602.09934·cs.CV·February 11, 2026

VersaViT: Enhancing MLLM Vision Backbones via Task-Guided Optimization

Yikun Liu, Yuan Liu, Shangzhe Di, Haicheng Wang, Zhongyin Zhao, Le Tian, Xiao Zhou, Jie Zhou, Jiangchao Yao, Yanfeng Wang, Weidi Xie

PDF

Open Access 1 Models

TL;DR

VersaViT introduces a multi-task framework to enhance vision encoders in multimodal large language models, enabling them to perform well on both high-level reasoning and dense vision tasks.

Contribution

The paper proposes VersaViT, a novel multi-task post-training method that improves vision backbones for diverse vision tasks within MLLMs.

Findings

01

Improved performance on dense prediction tasks.

02

Versatile backbone suitable for reasoning and pixel-level understanding.

03

Effective multi-task optimization framework.

Abstract

Multimodal Large Language Models (MLLMs) have recently achieved remarkable success in visual-language understanding, demonstrating superior high-level semantic alignment within their vision encoders. An important question thus arises: Can these encoders serve as versatile vision backbones, capable of reliably performing classic vision-centric tasks as well? To address the question, we make the following contributions: (i) we identify that the vision encoders within MLLMs exhibit deficiencies in their dense feature representations, as evidenced by their suboptimal performance on dense prediction tasks (e.g., semantic segmentation, depth estimation); (ii) we propose VersaViT, a well-rounded vision transformer that instantiates a novel multi-task framework for collaborative post-training. This framework facilitates the optimization of the vision backbone via lightweight task heads with…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

🤗
tencent/VersaViT
model· 22 dl· ♡ 4
22 dl♡ 4

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Explainable Artificial Intelligence (XAI)