MultiModal-GPT: A Vision and Language Model for Dialogue with Humans

Tao Gong; Chengqi Lyu; Shilong Zhang; Yudong Wang; Miao Zheng; Qian; Zhao; Kuikun Liu; Wenwei Zhang; Ping Luo; Kai Chen

arXiv:2305.04790·cs.CV·June 14, 2023·65 cites

MultiModal-GPT: A Vision and Language Model for Dialogue with Humans

Tao Gong, Chengqi Lyu, Shilong Zhang, Yudong Wang, Miao Zheng, Qian, Zhao, Kuikun Liu, Wenwei Zhang, Ping Luo, Kai Chen

PDF

Open Access 1 Repo

TL;DR

MultiModal-GPT is a parameter-efficient vision and language model designed for multi-round human dialogue, capable of understanding and following diverse instructions through multi-modality instruction tuning and joint training.

Contribution

It introduces a multi-modal instruction tuning approach with vision and language data, and demonstrates improved dialogue performance via joint training with language-only data.

Findings

01

Effective multi-round dialogue with humans demonstrated

02

Joint training with language-only data enhances dialogue quality

03

Model responds accurately to diverse vision-language instructions

Abstract

We present a vision and language model named MultiModal-GPT to conduct multi-round dialogue with humans. MultiModal-GPT can follow various instructions from humans, such as generating a detailed caption, counting the number of interested objects, and answering general questions from users. MultiModal-GPT is parameter-efficiently fine-tuned from OpenFlamingo, with Low-rank Adapter (LoRA) added both in the cross-attention part and the self-attention part of the language model. We first construct instruction templates with vision and language data for multi-modality instruction tuning to make the model understand and follow human instructions. We find the quality of training data is vital for the dialogue performance, where few data containing short answers can lead the model to respond shortly to any instructions. To further enhance the ability to chat with humans of the MultiModal-GPT,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

open-mmlab/multimodal-gpt
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Topic Modeling · Natural Language Processing Techniques

MethodsAdapter