MANTIS: Interleaved Multi-Image Instruction Tuning
Dongfu Jiang, Xuan He, Huaye Zeng, Cong Wei, Max Ku, Qian Liu, Wenhu, Chen

TL;DR
This paper introduces Mantis, a multi-image instruction tuning approach that significantly improves multi-image visual language tasks with less data and training, achieving state-of-the-art results and strong generalization.
Contribution
Mantis demonstrates that effective multi-image abilities can be achieved through instruction tuning on a modest dataset, challenging the reliance on massive pre-training.
Findings
Mantis-Idefics2 achieves state-of-the-art on all multi-image benchmarks.
Mantis outperforms larger pre-trained models like Idefics2-8B.
Mantis maintains strong single-image performance, showing versatility.
Abstract
Large multimodal models (LMMs) have shown great results in single-image vision language tasks. However, their abilities to solve multi-image visual language tasks is yet to be improved. The existing LMMs like OpenFlamingo, Emu2, and Idefics gain their multi-image ability through pre-training on hundreds of millions of noisy interleaved image-text data from the web, which is neither efficient nor effective. In this paper, we aim to build strong multi-image LMMs via instruction tuning with academic-level resources. Therefore, we meticulously construct Mantis-Instruct containing 721K multi-image instruction data to train a family of Mantis models. The instruction tuning empowers Mantis with different multi-image skills like co-reference, comparison, reasoning, and temporal understanding. We evaluate Mantis on 8 multi-image benchmarks and 6 single-image benchmarks. Mantis-Idefics2 can…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Vision and Imaging · Video Coding and Compression Technologies · Advanced Data Compression Techniques
