STRIDE-QA: Visual Question Answering Dataset for Spatiotemporal Reasoning in Urban Driving Scenes
Keishi Ishihara, Kento Sasaki, Tsubasa Takahashi, Daiki Shiono, Yu Yamaguchi

TL;DR
STRIDE-QA is a large-scale, multi-sensor dataset designed for spatiotemporal reasoning in urban driving scenes, enabling better training and evaluation of vision-language models for autonomous driving.
Contribution
This paper introduces STRIDE-QA, the largest VQA dataset for urban driving, with novel QA tasks requiring spatial and temporal reasoning, and demonstrates its effectiveness in improving model performance.
Findings
Existing VLMs perform poorly on spatiotemporal tasks
Fine-tuning on STRIDE-QA significantly improves VLM performance
STRIDE-QA enables development of safer autonomous driving systems
Abstract
Vision-Language Models (VLMs) have been applied to autonomous driving to support decision-making in complex real-world scenarios. However, their training on static, web-sourced image-text pairs fundamentally limits the precise spatiotemporal reasoning required to understand and predict dynamic traffic scenes. We address this critical gap with STRIDE-QA, a large-scale visual question answering (VQA) dataset for physically grounded reasoning from an ego-centric perspective. Constructed from 100 hours of multi-sensor driving data in Tokyo, capturing diverse and challenging conditions, STRIDE-QA is the largest VQA dataset for spatiotemporal reasoning in urban driving, offering 16M QA pairs over 270K frames. Grounded by dense, automatically generated annotations including 3D bounding boxes, segmentation masks, and multi-object tracks, the dataset uniquely supports both object-centric and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
