MM-Instruct: Generated Visual Instructions for Large Multimodal Model   Alignment

Jihao Liu; Xin Huang; Jinliang Zheng; Boxiao Liu; Jia Wang; and Osamu Yoshie; Yu Liu; Hongsheng Li

arXiv:2406.19736·cs.CV·July 1, 2024·1 cites

MM-Instruct: Generated Visual Instructions for Large Multimodal Model Alignment

Jihao Liu, Xin Huang, Jinliang Zheng, Boxiao Liu, Jia Wang, and Osamu Yoshie, Yu Liu, Hongsheng Li

PDF

Open Access 2 Repos 1 Datasets

TL;DR

This paper presents MM-Instruct, a large-scale dataset of visual instructions generated using LLMs to improve multimodal model instruction-following, along with a benchmark and improved model performance.

Contribution

It introduces a novel method to generate diverse visual instruction data using LLMs from existing datasets, enhancing multimodal model capabilities.

Findings

01

LLaVA-Instruct outperforms previous models in instruction-following tasks.

02

Generated dataset improves model generalization to diverse visual instructions.

03

Benchmark provides a new standard for evaluating multimodal instruction-following.

Abstract

This paper introduces MM-Instruct, a large-scale dataset of diverse and high-quality visual instruction data designed to enhance the instruction-following capabilities of large multimodal models (LMMs). While existing visual instruction datasets often focus on question-answering, they struggle to generalize to broader application scenarios such as creative writing, summarization, or image analysis. To address these limitations, we propose a novel approach to constructing MM-Instruct that leverages the strong instruction-following capabilities of existing LLMs to generate novel visual instruction data from large-scale but conventional image captioning datasets. MM-Instruct first leverages ChatGPT to automatically generate diverse instructions from a small set of seed instructions through augmenting and summarization. It then matches these instructions with images and uses an open-sourced…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Datasets

jjjjh/MM-Instruct
dataset· 88 dl
88 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and dialogue systems · Semantic Web and Ontologies · Natural Language Processing Techniques

MethodsSparse Evolutionary Training · Focus