VNU-Bench: A Benchmarking Dataset for Multi-Source Multimodal News Video Understanding
Zibo Liu, Muyang Li, Zhe Jiang, Shigang Chen

TL;DR
VNU-Bench is a new benchmark dataset designed to evaluate multimodal large language models on their ability to understand and compare multi-source news videos, addressing a gap in existing single-source focused benchmarks.
Contribution
It introduces the first multi-source, cross-video news understanding benchmark with a novel QA generation process and a comprehensive dataset of news videos and questions.
Findings
Current MLLMs struggle with VNU-Bench challenges.
The dataset includes 429 news groups and 2,501 questions.
VNU-Bench reveals gaps in multi-source multimodal understanding.
Abstract
News videos are carefully edited multimodal narratives that combine narration, visuals, and external quotations into coherent storylines. In recent years, there have been significant advances in evaluating multimodal large language models (MLLMs) for news video understanding. However, existing benchmarks largely focus on single-source, intra-video reasoning, where each report is processed in isolation. In contrast, real-world news consumption is inherently multi-sourced: the same event is reported by different outlets with complementary details, distinct narrative choices, and sometimes conflicting claims that unfold over time. Robust news understanding, therefore, requires models to compare perspectives from different sources, align multimodal evidence across sources, and synthesize multi-source information. To fill this gap, we introduce VNU-Bench, the first benchmark for…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Topic Modeling · Misinformation and Its Impacts
