MT-Video-Bench: A Holistic Video Understanding Benchmark for Evaluating Multimodal LLMs in Multi-Turn Dialogues

Yaning Pan; Qianqian Xie; Guohui Zhang; Zekun Wang; Yongqian Wen; Yuanxing Zhang; Haoxuan Hu; Zhiyu Pan; Yibing Huang; Zhidong Gan; Yonghong Lin; An Ping; Shihao Li; Yanghai Wang; Tianhao Peng; Jiaheng Liu

arXiv:2510.17722·cs.CV·January 9, 2026

MT-Video-Bench: A Holistic Video Understanding Benchmark for Evaluating Multimodal LLMs in Multi-Turn Dialogues

Yaning Pan, Qianqian Xie, Guohui Zhang, Zekun Wang, Yongqian Wen, Yuanxing Zhang, Haoxuan Hu, Zhiyu Pan, Yibing Huang, Zhidong Gan, Yonghong Lin, An Ping, Shihao Li, Yanghai Wang, Tianhao Peng, Jiaheng Liu

PDF

Open Access 1 Datasets

TL;DR

MT-Video-Bench is a comprehensive benchmark designed to evaluate multimodal large language models in multi-turn video dialogues, addressing a gap in existing single-turn focused evaluation methods.

Contribution

Introduces MT-Video-Bench, a new holistic benchmark with 1,000 multi-turn dialogues to assess MLLMs' capabilities in complex video understanding tasks.

Findings

01

Significant performance gaps among current MLLMs.

02

Identified limitations in handling multi-turn video dialogues.

03

Benchmark promotes future research in multimodal dialogue systems.

Abstract

The recent development of Multimodal Large Language Models (MLLMs) has significantly advanced AI's ability to understand visual modalities. However, existing evaluation benchmarks remain limited to single-turn question answering, overlooking the complexity of multi-turn dialogues in real-world scenarios. To bridge this gap, we introduce MT-Video-Bench, a holistic video understanding benchmark for evaluating MLLMs in multi-turn dialogues. Specifically, our MT-Video-Bench mainly assesses 6 core competencies that focus on perceptivity and interactivity, encompassing 1,000 meticulously curated multi-turn dialogues from diverse domains. These capabilities are rigorously aligned with real-world applications, such as interactive sports analysis and multi-turn video-based intelligent tutoring. With MT-Video-Bench, we extensively evaluate various state-of-the-art open-source and closed-source…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

NJU-LINK/MT-Video-Bench
dataset· 55 dl
55 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Speech and dialogue systems · Topic Modeling