VisionLLM v2: An End-to-End Generalist Multimodal Large Language Model   for Hundreds of Vision-Language Tasks

Jiannan Wu; Muyan Zhong; Sen Xing; Zeqiang Lai; Zhaoyang Liu; Zhe; Chen; Wenhai Wang; Xizhou Zhu; Lewei Lu; Tong Lu; Ping Luo; Yu Qiao; Jifeng; Dai

arXiv:2406.08394·cs.CV·January 3, 2025·5 cites

VisionLLM v2: An End-to-End Generalist Multimodal Large Language Model for Hundreds of Vision-Language Tasks

Jiannan Wu, Muyan Zhong, Sen Xing, Zeqiang Lai, Zhaoyang Liu, Zhe, Chen, Wenhai Wang, Xizhou Zhu, Lewei Lu, Tong Lu, Ping Luo, Yu Qiao, Jifeng, Dai

PDF

Open Access 1 Repo

TL;DR

VisionLLM v2 is a comprehensive multimodal large language model that unifies visual perception, understanding, and generation, enabling it to perform a wide range of vision-language tasks with a single, end-to-end trained framework.

Contribution

It introduces the 'super link' mechanism for flexible task information transmission and demonstrates end-to-end training on hundreds of diverse vision-language tasks.

Findings

01

Achieves performance comparable to task-specific models across many tasks.

02

Supports diverse vision tasks including VQA, object localization, pose estimation, and image editing.

03

Effectively resolves multi-task training conflicts.

Abstract

We present VisionLLM v2, an end-to-end generalist multimodal large model (MLLM) that unifies visual perception, understanding, and generation within a single framework. Unlike traditional MLLMs limited to text output, VisionLLM v2 significantly broadens its application scope. It excels not only in conventional visual question answering (VQA) but also in open-ended, cross-domain vision tasks such as object localization, pose estimation, and image generation and editing. To this end, we propose a new information transmission mechanism termed "super link", as a medium to connect MLLM with task-specific decoders. It not only allows flexible transmission of task information and gradient feedback between the MLLM and multiple downstream decoders but also effectively resolves training conflicts in multi-tasking scenarios. In addition, to support the diverse range of tasks, we carefully…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

opengvlab/visionllm
jaxOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications

MethodsSparse Evolutionary Training