MMInstruct: A High-Quality Multi-Modal Instruction Tuning Dataset with   Extensive Diversity

Yangzhou Liu; Yue Cao; Zhangwei Gao; Weiyun Wang; Zhe Chen; Wenhai; Wang; Hao Tian; Lewei Lu; Xizhou Zhu; Tong Lu; Yu Qiao; Jifeng Dai

arXiv:2407.15838·cs.CV·December 17, 2024

MMInstruct: A High-Quality Multi-Modal Instruction Tuning Dataset with Extensive Diversity

Yangzhou Liu, Yue Cao, Zhangwei Gao, Weiyun Wang, Zhe Chen, Wenhai, Wang, Hao Tian, Lewei Lu, Xizhou Zhu, Tong Lu, Yu Qiao, Jifeng Dai

PDF

1 Repo 1 Datasets

TL;DR

This paper introduces MMInstruct, a large, diverse, high-quality visual instruction dataset created using semi-automatic generation with GPT models, significantly improving vision-language model performance across multiple benchmarks.

Contribution

The paper presents MMInstruct, a novel multi-domain visual instruction dataset generated with an efficient, semi-automatic method leveraging GPT models, enhancing model performance and diversity.

Findings

01

Model fine-tuning on MMInstruct achieves state-of-the-art results on 10 out of 12 benchmarks.

02

The instruction generation process is low-cost and scalable, reducing manual effort.

03

MMInstruct significantly improves the diversity and quality of visual instruction data.

Abstract

Despite the effectiveness of vision-language supervised fine-tuning in enhancing the performance of Vision Large Language Models (VLLMs). However, existing visual instruction tuning datasets include the following limitations: (1) Instruction annotation quality: despite existing VLLMs exhibiting strong performance, instructions generated by those advanced VLLMs may still suffer from inaccuracies, such as hallucinations. (2) Instructions and image diversity: the limited range of instruction types and the lack of diversity in image data may impact the model's ability to generate diversified and closer to real-world scenarios outputs. To address these challenges, we construct a high-quality, diverse visual instruction tuning dataset MMInstruct, which consists of 973K instructions from 24 domains. There are four instruction types: Judgement, Multiple-Choice, Long Visual Question Answering…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

yuecao0119/mminstruct
pytorchOfficial

Datasets

yuecao0119/MMInstruct-GPT4V
dataset· 578 dl
578 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Attention Is All You Need · Cosine Annealing · Linear Warmup With Cosine Annealing · Residual Connection · Dropout · Adam · Byte Pair Encoding · Layer Normalization · Linear Layer