InterAct-Video: Reasoning-Rich Video QA for Urban Traffic

Joseph Raj Vishal; Divesh Basina; Rutuja Patil; Manas Srinivas Gowda; Katha Naik; Yezhou Yang; Bharatesh Chakravarthi

arXiv:2507.14743·cs.CV·August 29, 2025

InterAct-Video: Reasoning-Rich Video QA for Urban Traffic

Joseph Raj Vishal, Divesh Basina, Rutuja Patil, Manas Srinivas Gowda, Katha Naik, Yezhou Yang, Bharatesh Chakravarthi

PDF

Open Access

TL;DR

InterAct-Video introduces a new dataset and benchmark for VideoQA in traffic scenes, highlighting the need for specialized models to understand complex spatiotemporal traffic events.

Contribution

The paper presents the InterAct VideoQA dataset with real-world traffic videos and questions, enabling better evaluation and development of reasoning-rich VideoQA models for urban traffic monitoring.

Findings

01

State-of-the-art models struggle with complex traffic scenes.

02

Fine-tuning improves model performance significantly.

03

The dataset reveals key challenges in traffic scene understanding.

Abstract

Traffic monitoring is crucial for urban mobility, road safety, and intelligent transportation systems (ITS). Deep learning has advanced video-based traffic monitoring through video question answering (VideoQA) models, enabling structured insight extraction from traffic videos. However, existing VideoQA models struggle with the complexity of real-world traffic scenes, where multiple concurrent events unfold across spatiotemporal dimensions. To address these challenges, this paper introduces \textbf{InterAct VideoQA}, a curated dataset designed to benchmark and enhance VideoQA models for traffic monitoring tasks. The InterAct VideoQA dataset comprises 8 hours of real-world traffic footage collected from diverse intersections, segmented into 10-second video clips, with over 25,000 question-answer (QA) pairs covering spatiotemporal dynamics, vehicle interactions, incident detection, and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Topic Modeling · Human Pose and Action Recognition