CCTVBench: Contrastive Consistency Traffic VideoQA Benchmark for Multimodal LLMs
Xingcheng Zhou, Hao Guo, Rui Song, Walter Zimmer, Mingyu Liu, Andr\'e Schamschurko, Hu Cao, Alois Knoll

TL;DR
CCTVBench is a new benchmark for traffic video question answering that emphasizes contrastive consistency, testing models' ability to detect true hazards and reject false hypotheses in near-identical scenes.
Contribution
The paper introduces CCTVBench, a benchmark with paired real and counterfactual videos, and proposes C-TCD, a contrastive decoding method to improve model consistency and accuracy.
Findings
Models show a large gap between QA accuracy and contrastive consistency.
Unreliable rejection of none-of-the-above answers is a key challenge.
C-TCD improves both QA accuracy and contrastive consistency.
Abstract
Safety-critical traffic reasoning requires contrastive consistency: models must detect true hazards when an accident occurs, and reliably reject plausible-but-false hypotheses under near-identical counterfactual scenes. We present CCTVBench, a Contrastive Consistency Traffic VideoQA Benchmark built on paired real accident videos and world-model-generated counterfactual counterparts, together with minimally different, mutually exclusive hypothesis questions. CCTVBench enforces a single structured decision pattern over each video question quadruple and provides actionable diagnostics that decompose failures into positive omission, positive swap, negative hallucination, and mutual-exclusivity violation, while separating video versus question consistency. Experiments across open-source and proprietary video LLMs reveal a large and persistent gap between standard per-instance QA metrics and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
