Task Preference Optimization: Improving Multimodal Large Language Models with Vision Task Alignment

Ziang Yan; Zhilin Li; Yinan He; Chenting Wang; Kunchang Li; Xinhao Li; Xiangyu Zeng; Zilei Wang; Yali Wang; Yu Qiao; Limin Wang; Yi Wang

arXiv:2412.19326·cs.CV·July 1, 2025

Task Preference Optimization: Improving Multimodal Large Language Models with Vision Task Alignment

Ziang Yan, Zhilin Li, Yinan He, Chenting Wang, Kunchang Li, Xinhao Li, Xiangyu Zeng, Zilei Wang, Yali Wang, Yu Qiao, Limin Wang, Yi Wang

PDF

Open Access 1 Repo 1 Models

TL;DR

Task Preference Optimization (TPO) enhances multimodal large language models by learning task-specific preferences, significantly improving performance and zero-shot capabilities across visual tasks through multi-task co-training.

Contribution

The paper introduces TPO, a scalable method with learnable task tokens that improves multimodal models' performance by leveraging rich visual labels and multi-task co-training.

Findings

01

14.6% overall performance improvement

02

Robust zero-shot capabilities across tasks

03

Synergistic benefits from multi-task co-training

Abstract

Current multimodal large language models (MLLMs) struggle with fine-grained or precise understanding of visuals although they give comprehensive perception and reasoning in a spectrum of vision applications. Recent studies either develop tool-using or unify specific visual tasks into the autoregressive framework, often at the expense of overall multimodal performance. To address this issue and enhance MLLMs with visual tasks in a scalable fashion, we propose Task Preference Optimization (TPO), a novel method that utilizes differentiable task preferences derived from typical fine-grained visual tasks. TPO introduces learnable task tokens that establish connections between multiple task-specific heads and the MLLM. By leveraging rich visual labels during training, TPO significantly enhances the MLLM's multimodal capabilities and task-specific performance. Through multi-task co-training…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

opengvlab/tpo
pytorchOfficial

Models

🤗
OpenGVLab/VideoChat-TPO
model· 73 dl· ♡ 5
73 dl♡ 5

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications