MMDU: A Multi-Turn Multi-Image Dialog Understanding Benchmark and   Instruction-Tuning Dataset for LVLMs

Ziyu Liu; Tao Chu; Yuhang Zang; Xilin Wei; Xiaoyi Dong; Pan Zhang,; Zijian Liang; Yuanjun Xiong; Yu Qiao; Dahua Lin; Jiaqi Wang

arXiv:2406.11833·cs.CV·October 30, 2024·2 cites

MMDU: A Multi-Turn Multi-Image Dialog Understanding Benchmark and Instruction-Tuning Dataset for LVLMs

Ziyu Liu, Tao Chu, Yuhang Zang, Xilin Wei, Xiaoyi Dong, Pan Zhang,, Zijian Liang, Yuanjun Xiong, Yu Qiao, Dahua Lin, Jiaqi Wang

PDF

Open Access 1 Repo 2 Datasets 1 Video

TL;DR

The paper introduces MMDU, a comprehensive benchmark and dataset for evaluating and enhancing multi-turn, multi-image vision-language models, addressing limitations of existing benchmarks and demonstrating the benefits of instruction tuning.

Contribution

It presents MMDU and MMDU-45k, the first large-scale benchmark and dataset for multi-turn, multi-image LVLMs, and shows fine-tuning on this data improves model performance.

Findings

01

Open-source LVLMs lag behind closed-source models in multi-turn, multi-image tasks.

02

Fine-tuning on MMDU-45k significantly improves conversation length and accuracy.

03

Models show increased scores on MMDU and other benchmarks after fine-tuning.

Abstract

Generating natural and meaningful responses to communicate with multi-modal human inputs is a fundamental capability of Large Vision-Language Models(LVLMs). While current open-source LVLMs demonstrate promising performance in simplified scenarios such as single-turn single-image input, they fall short in real-world conversation scenarios such as following instructions in a long context history with multi-turn and multi-images. Existing LVLM benchmarks primarily focus on single-choice questions or short-form responses, which do not adequately assess the capabilities of LVLMs in real-world human-AI interaction applications. Therefore, we introduce MMDU, a comprehensive benchmark, and MMDU-45k, a large-scale instruction tuning dataset, designed to evaluate and improve LVLMs' abilities in multi-turn and multi-image conversations. We employ the clustering algorithm to ffnd the relevant…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

liuziyu77/mmdu
pytorchOfficial

Datasets

Videos

MMDU: A Multi-Turn Multi-Image Dialog Understanding Benchmark and Instruction-Tuning Dataset for LVLMs· slideslive

Taxonomy

TopicsMultimodal Machine Learning Applications · Natural Language Processing Techniques

MethodsFocus