MM-Instruct: Generated Visual Instructions for Large Multimodal Model Alignment
Jihao Liu, Xin Huang, Jinliang Zheng, Boxiao Liu, Jia Wang, and Osamu Yoshie, Yu Liu, Hongsheng Li

TL;DR
This paper presents MM-Instruct, a large-scale dataset of visual instructions generated using LLMs to improve multimodal model instruction-following, along with a benchmark and improved model performance.
Contribution
It introduces a novel method to generate diverse visual instruction data using LLMs from existing datasets, enhancing multimodal model capabilities.
Findings
LLaVA-Instruct outperforms previous models in instruction-following tasks.
Generated dataset improves model generalization to diverse visual instructions.
Benchmark provides a new standard for evaluating multimodal instruction-following.
Abstract
This paper introduces MM-Instruct, a large-scale dataset of diverse and high-quality visual instruction data designed to enhance the instruction-following capabilities of large multimodal models (LMMs). While existing visual instruction datasets often focus on question-answering, they struggle to generalize to broader application scenarios such as creative writing, summarization, or image analysis. To address these limitations, we propose a novel approach to constructing MM-Instruct that leverages the strong instruction-following capabilities of existing LLMs to generate novel visual instruction data from large-scale but conventional image captioning datasets. MM-Instruct first leverages ChatGPT to automatically generate diverse instructions from a small set of seed instructions through augmenting and summarization. It then matches these instructions with images and uses an open-sourced…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and dialogue systems · Semantic Web and Ontologies · Natural Language Processing Techniques
MethodsSparse Evolutionary Training · Focus
