MMDU: A Multi-Turn Multi-Image Dialog Understanding Benchmark and Instruction-Tuning Dataset for LVLMs
Ziyu Liu, Tao Chu, Yuhang Zang, Xilin Wei, Xiaoyi Dong, Pan Zhang,, Zijian Liang, Yuanjun Xiong, Yu Qiao, Dahua Lin, Jiaqi Wang

TL;DR
The paper introduces MMDU, a comprehensive benchmark and dataset for evaluating and enhancing multi-turn, multi-image vision-language models, addressing limitations of existing benchmarks and demonstrating the benefits of instruction tuning.
Contribution
It presents MMDU and MMDU-45k, the first large-scale benchmark and dataset for multi-turn, multi-image LVLMs, and shows fine-tuning on this data improves model performance.
Findings
Open-source LVLMs lag behind closed-source models in multi-turn, multi-image tasks.
Fine-tuning on MMDU-45k significantly improves conversation length and accuracy.
Models show increased scores on MMDU and other benchmarks after fine-tuning.
Abstract
Generating natural and meaningful responses to communicate with multi-modal human inputs is a fundamental capability of Large Vision-Language Models(LVLMs). While current open-source LVLMs demonstrate promising performance in simplified scenarios such as single-turn single-image input, they fall short in real-world conversation scenarios such as following instructions in a long context history with multi-turn and multi-images. Existing LVLM benchmarks primarily focus on single-choice questions or short-form responses, which do not adequately assess the capabilities of LVLMs in real-world human-AI interaction applications. Therefore, we introduce MMDU, a comprehensive benchmark, and MMDU-45k, a large-scale instruction tuning dataset, designed to evaluate and improve LVLMs' abilities in multi-turn and multi-image conversations. We employ the clustering algorithm to ffnd the relevant…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsMultimodal Machine Learning Applications · Natural Language Processing Techniques
MethodsFocus
