MIMIC-IT: Multi-Modal In-Context Instruction Tuning
Bo Li, Yuanhan Zhang, Liangyu Chen, Jinghao Wang, Fanyi Pu, Jingkang, Yang, Chunyuan Li, Ziwei Liu

TL;DR
MIMIC-IT introduces a large, diverse dataset of 2.8 million multimodal instruction-response pairs with multi-modal context, enabling improved training and evaluation of vision-language models like Otter for perception, reasoning, and planning tasks.
Contribution
The paper presents MIMIC-IT, a novel large-scale multimodal instruction-response dataset with multi-modal context, and demonstrates its effectiveness by training Otter, a VLM that excels in perception and reasoning.
Findings
Otter shows strong performance on vision-language benchmarks.
MIMIC-IT dataset enhances multi-modal perception and reasoning.
Human evaluation confirms alignment with user intentions.
Abstract
High-quality instructions and responses are essential for the zero-shot performance of large language models on interactive natural language tasks. For interactive vision-language tasks involving intricate visual scenes, a large quantity of diverse and creative instruction-response pairs should be imperative to tune vision-language models (VLMs). Nevertheless, the current availability of vision-language instruction-response pairs in terms of quantity, diversity, and creativity remains limited, posing challenges to the generalization of interactive VLMs. Here we present MultI-Modal In-Context Instruction Tuning (MIMIC-IT), a dataset comprising 2.8 million multimodal instruction-response pairs, with 2.2 million unique instructions derived from images and videos. Each pair is accompanied by multi-modal in-context information, forming conversational contexts aimed at empowering VLMs in…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗HuggingFaceM4/idefics-80bmodel· 331 dl· ♡ 69331 dl♡ 69
- 🤗HuggingFaceM4/idefics-9bmodel· 1.9k dl· ♡ 471.9k dl♡ 47
- 🤗HuggingFaceM4/idefics-9b-instructmodel· 1.2k dl· ♡ 1071.2k dl♡ 107
- 🤗HuggingFaceM4/idefics-80b-instructmodel· 5.3k dl· ♡ 1895.3k dl♡ 189
- 🤗sugiv/Spoonbill-Llama2OtterFlamingoAreFriends-7B-Chatmodel
- 🤗sugiv/Spoonbill-GarudaOtterFlamingoAreFriends-7B-Chatmodel
- 🤗areegtarek/idefics-9b-instruct-allmodel· 12 dl12 dl
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Natural Language Processing Techniques · Subtitles and Audiovisual Media
