TUMTraffic-VideoQA: A Benchmark for Unified Spatio-Temporal Video Understanding in Traffic Scenes
Xingcheng Zhou, Konstantinos Larintzakis, Hao Guo, Walter Zimmer,, Mingyu Liu, Hu Cao, Jiajie Zhang, Venkatnarayanan Lakshminarasimhan, Leah, Strand, Alois C. Knoll

TL;DR
TUMTraffic-VideoQA introduces a comprehensive dataset and benchmark for evaluating spatio-temporal reasoning in traffic scene videos, integrating multiple tasks to advance intelligent transportation research.
Contribution
It provides a new large-scale dataset with unified tasks and a baseline model, addressing the complexity of real-world traffic scene understanding.
Findings
The dataset contains 1,000 videos with diverse conditions.
Existing models show limitations on this challenging benchmark.
The TUMTraffic-Qwen baseline offers insights into fine-grained reasoning challenges.
Abstract
We present TUMTraffic-VideoQA, a novel dataset and benchmark designed for spatio-temporal video understanding in complex roadside traffic scenarios. The dataset comprises 1,000 videos, featuring 85,000 multiple-choice QA pairs, 2,300 object captioning, and 5,700 object grounding annotations, encompassing diverse real-world conditions such as adverse weather and traffic anomalies. By incorporating tuple-based spatio-temporal object expressions, TUMTraffic-VideoQA unifies three essential tasks-multiple-choice video question answering, referred object captioning, and spatio-temporal object grounding-within a cohesive evaluation framework. We further introduce the TUMTraffic-Qwen baseline model, enhanced with visual token sampling strategies, providing valuable insights into the challenges of fine-grained spatio-temporal reasoning. Extensive experiments demonstrate the dataset's complexity,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenerative Adversarial Networks and Image Synthesis · Human Pose and Action Recognition · Anomaly Detection Techniques and Applications
