TUMTraffic-VideoQA: A Benchmark for Unified Spatio-Temporal Video   Understanding in Traffic Scenes

Xingcheng Zhou; Konstantinos Larintzakis; Hao Guo; Walter Zimmer,; Mingyu Liu; Hu Cao; Jiajie Zhang; Venkatnarayanan Lakshminarasimhan; Leah; Strand; Alois C. Knoll

arXiv:2502.02449·cs.CV·February 5, 2025

TUMTraffic-VideoQA: A Benchmark for Unified Spatio-Temporal Video Understanding in Traffic Scenes

Xingcheng Zhou, Konstantinos Larintzakis, Hao Guo, Walter Zimmer,, Mingyu Liu, Hu Cao, Jiajie Zhang, Venkatnarayanan Lakshminarasimhan, Leah, Strand, Alois C. Knoll

PDF

Open Access 1 Repo

TL;DR

TUMTraffic-VideoQA introduces a comprehensive dataset and benchmark for evaluating spatio-temporal reasoning in traffic scene videos, integrating multiple tasks to advance intelligent transportation research.

Contribution

It provides a new large-scale dataset with unified tasks and a baseline model, addressing the complexity of real-world traffic scene understanding.

Findings

01

The dataset contains 1,000 videos with diverse conditions.

02

Existing models show limitations on this challenging benchmark.

03

The TUMTraffic-Qwen baseline offers insights into fine-grained reasoning challenges.

Abstract

We present TUMTraffic-VideoQA, a novel dataset and benchmark designed for spatio-temporal video understanding in complex roadside traffic scenarios. The dataset comprises 1,000 videos, featuring 85,000 multiple-choice QA pairs, 2,300 object captioning, and 5,700 object grounding annotations, encompassing diverse real-world conditions such as adverse weather and traffic anomalies. By incorporating tuple-based spatio-temporal object expressions, TUMTraffic-VideoQA unifies three essential tasks-multiple-choice video question answering, referred object captioning, and spatio-temporal object grounding-within a cohesive evaluation framework. We further introduce the TUMTraffic-Qwen baseline model, enhanced with visual token sampling strategies, providing valuable insights into the challenges of fine-grained spatio-temporal reasoning. Extensive experiments demonstrate the dataset's complexity,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

TraffiX-VideoQA/TUMTraffic-VideoQA-Baseline
none

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Human Pose and Action Recognition · Anomaly Detection Techniques and Applications