ChineseVideoBench: Benchmarking Multi-modal Large Models for Chinese Video Question Answering
Yuxiang Nie, Han Wang, Yongjie Ye, Haiyang Yu, Weitao Jia, Tao Zeng, Hao Feng, Xiang Fei, Yang Li, Xiaohui Lv, Guozhi Tang, Jingqun Tang, Jinghui Lu, Zehui Dai, Jiacong Wang, Dingkang Yang, An-Lan Wang, Can Huang

TL;DR
ChineseVideoBench is a new benchmark designed to evaluate multimodal large language models on Chinese video question answering, emphasizing cultural and linguistic understanding with comprehensive datasets and metrics.
Contribution
It introduces a specialized benchmark with a detailed dataset and evaluation framework for assessing MLLMs on Chinese video content, addressing a significant evaluation gap.
Findings
ChineseVideoBench is challenging for current MLLMs.
Gemini 2.5 Pro achieves the highest score of 77.9%.
InternVL-38B is the most competitive open-source model.
Abstract
This paper introduces ChineseVideoBench, a pioneering benchmark specifically designed for evaluating Multimodal Large Language Models (MLLMs) in Chinese Video Question Answering. The growing demand for sophisticated video analysis capabilities highlights the critical need for comprehensive, culturally-aware evaluation frameworks. ChineseVideoBench addresses this gap by providing a robust dataset and tailored evaluation metrics, enabling rigorous assessment of state-of-the-art MLLMs on complex Chinese video content. Specifically, ChineseVideoBench comprises 8 main classes and 12 sub-classes, encompassing tasks that demand both deep video understanding and nuanced Chinese linguistic and cultural awareness. Our empirical evaluations reveal that ChineseVideoBench presents a significant challenge to current MLLMs. Among the models assessed, Gemini 2.5 Pro achieves the highest performance…
Peer Reviews
Decision·Submitted to ICLR 2026
- It is a useful resource for a non-English language to evaluate multimodal LLMs. - In the annotation process, human annotators have been used to annotate, verify and validate the whole dataset that ensures an unbiased and potentially correct benchmark. - The developed benchmark covers sufficient number of tasks and underlying sub-categories showing diversity of topics. - The paper presents an evaluation of several MLLMs including open and closed source models in comparison with human performa
- Although, the annotation process has been done by using human annotators, but the paper does not show understanding and knowledge of annotators by using any quantitative measures, e.g., inter-annotator agreement (IAA). - Error analysis is not presented for LLM evaluation. For example, which type of tasks or questions are difficult for the models to answer. - The videos were collected from CC0-licensed platform, so it is very much possible that the LLMs evaluated in the paper have already seen
The paper introduces the first large-scale benchmark for Chinese VideoQA, covering 1,625 CC0-licensed videos and 6,507 manually annotated QA pairs The benchmark exposes systematic failure patterns in temporal localization and fine-grained spatiotemporal grounding.
The paper does not report inter-annotator consistency, distractor calibration, or difficulty-level validation, leaving the annotation quality claims insufficiently quantified. The paper emphasizes long-video evaluation but does not provide convincing evidence that its videos require long-term temporal reasoning; average durations appear modest, weakening this claim
This paper makes valuable contributions through its novel ChineseVideoBench dataset and rigorous evaluation. The benchmark features carefully curated videos and high-quality human annotations across diverse domains. A key strength is the comprehensive evaluation of leading MLLMs, revealing significant performance gaps between models and human capability, particularly in temporal understanding. The work provides important insights into Chinese video understanding and offers a solid foundation for
1. The dataset scale (1,625 videos, 6,507 QA pairs) is modest compared to major English benchmarks, and task distribution is uneven - some categories like "World Knowledge" contain under 100 questions, potentially affecting evaluation reliability despite claims of balance. 2. The benchmark's scope is constrained by design: audio tracks are removed and only multiple-choice format is supported. While this simplifies evaluation, it limits real-world applicability and excludes generative QA formats
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Topic Modeling
