MiniGPT-v2: large language model as a unified interface for   vision-language multi-task learning

Jun Chen; Deyao Zhu; Xiaoqian Shen; Xiang Li; Zechun Liu; Pengchuan; Zhang; Raghuraman Krishnamoorthi; Vikas Chandra; Yunyang Xiong; Mohamed; Elhoseiny

arXiv:2310.09478·cs.CV·November 8, 2023·66 cites

MiniGPT-v2: large language model as a unified interface for vision-language multi-task learning

Jun Chen, Deyao Zhu, Xiaoqian Shen, Xiang Li, Zechun Liu, Pengchuan, Zhang, Raghuraman Krishnamoorthi, Vikas Chandra, Yunyang Xiong, Mohamed, Elhoseiny

PDF

Open Access 2 Repos 2 Models

TL;DR

MiniGPT-v2 is a unified vision-language model that uses task identifiers to effectively perform multiple tasks like image captioning, VQA, and grounding, achieving strong benchmark results.

Contribution

The paper introduces MiniGPT-v2, a multi-task vision-language model that employs unique task identifiers to improve task differentiation and learning efficiency.

Findings

01

Achieves strong performance on visual question-answering benchmarks.

02

Effectively handles multiple vision-language tasks with a single model.

03

Uses task identifiers to enhance task learning and differentiation.

Abstract

Large language models have shown their remarkable capabilities as a general interface for various language-related applications. Motivated by this, we target to build a unified interface for completing many vision-language tasks including image description, visual question answering, and visual grounding, among others. The challenge is to use a single model for performing diverse vision-language tasks effectively with simple multi-modal instructions. Towards this objective, we introduce MiniGPT-v2, a model that can be treated as a unified interface for better handling various vision-language tasks. We propose using unique identifiers for different tasks when training the model. These identifiers enable our model to better distinguish each task instruction effortlessly and also improve the model learning efficiency for each task. After the three-stage training, the experimental results…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Topic Modeling