CCTVBench: Contrastive Consistency Traffic VideoQA Benchmark for Multimodal LLMs

Xingcheng Zhou; Hao Guo; Rui Song; Walter Zimmer; Mingyu Liu; Andr\'e Schamschurko; Hu Cao; Alois Knoll

arXiv:2604.20460·cs.CV·April 23, 2026

CCTVBench: Contrastive Consistency Traffic VideoQA Benchmark for Multimodal LLMs

Xingcheng Zhou, Hao Guo, Rui Song, Walter Zimmer, Mingyu Liu, Andr\'e Schamschurko, Hu Cao, Alois Knoll

PDF

TL;DR

CCTVBench is a new benchmark for traffic video question answering that emphasizes contrastive consistency, testing models' ability to detect true hazards and reject false hypotheses in near-identical scenes.

Contribution

The paper introduces CCTVBench, a benchmark with paired real and counterfactual videos, and proposes C-TCD, a contrastive decoding method to improve model consistency and accuracy.

Findings

01

Models show a large gap between QA accuracy and contrastive consistency.

02

Unreliable rejection of none-of-the-above answers is a key challenge.

03

C-TCD improves both QA accuracy and contrastive consistency.

Abstract

Safety-critical traffic reasoning requires contrastive consistency: models must detect true hazards when an accident occurs, and reliably reject plausible-but-false hypotheses under near-identical counterfactual scenes. We present CCTVBench, a Contrastive Consistency Traffic VideoQA Benchmark built on paired real accident videos and world-model-generated counterfactual counterparts, together with minimally different, mutually exclusive hypothesis questions. CCTVBench enforces a single structured decision pattern over each video question quadruple and provides actionable diagnostics that decompose failures into positive omission, positive swap, negative hallucination, and mutual-exclusivity violation, while separating video versus question consistency. Experiments across open-source and proprietary video LLMs reveal a large and persistent gap between standard per-instance QA metrics and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.