UDVideoQA: A Traffic Video Question Answering Dataset for Multi-Object Spatio-Temporal Reasoning in Urban Dynamics

Joseph Raj Vishal; Nagasiri Poluri; Katha Naik; Rutuja Patil; Kashyap Hegde Kota; Krishna Vinod; Prithvi Jai Ramesh; Mohammad Farhadi; Yezhou Yang; Bharatesh Chakravarthi

arXiv:2602.21137·cs.CV·February 25, 2026

UDVideoQA: A Traffic Video Question Answering Dataset for Multi-Object Spatio-Temporal Reasoning in Urban Dynamics

Joseph Raj Vishal, Nagasiri Poluri, Katha Naik, Rutuja Patil, Kashyap Hegde Kota, Krishna Vinod, Prithvi Jai Ramesh, Mohammad Farhadi, Yezhou Yang, Bharatesh Chakravarthi

PDF

Open Access

TL;DR

UDVideoQA introduces a comprehensive traffic video question answering dataset capturing real urban scenes, enabling systematic evaluation of visual grounding and causal reasoning in dynamic traffic environments.

Contribution

The paper presents UDVideoQA, a large-scale, privacy-preserving dataset with hierarchical reasoning annotations, and benchmarks multiple models to evaluate their reasoning capabilities in urban traffic videos.

Findings

01

Models show a perception-reasoning gap, struggling with visual grounding.

02

Fine-tuning improves model performance to near proprietary levels.

03

Models generate relevant questions but lack linguistic diversity.

Abstract

Understanding the complex, multi-agent dynamics of urban traffic remains a fundamental challenge for video language models. This paper introduces Urban Dynamics VideoQA, a benchmark dataset that captures the unscripted real-world behavior of dynamic urban scenes. UDVideoQA is curated from 16 hours of traffic footage recorded at multiple city intersections under diverse traffic, weather, and lighting conditions. It employs an event-driven dynamic blur technique to ensure privacy preservation without compromising scene fidelity. Using a unified annotation pipeline, the dataset contains 28K question-answer pairs generated across 8 hours of densely annotated video, averaging one question per second. Its taxonomy follows a hierarchical reasoning level, spanning basic understanding and attribution to event reasoning, reverse reasoning, and counterfactual inference, enabling systematic…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Topic Modeling · Generative Adversarial Networks and Image Synthesis