STRIDE-QA: Visual Question Answering Dataset for Spatiotemporal Reasoning in Urban Driving Scenes

Keishi Ishihara; Kento Sasaki; Tsubasa Takahashi; Daiki Shiono; Yu Yamaguchi

arXiv:2508.10427·cs.CV·January 21, 2026

STRIDE-QA: Visual Question Answering Dataset for Spatiotemporal Reasoning in Urban Driving Scenes

Keishi Ishihara, Kento Sasaki, Tsubasa Takahashi, Daiki Shiono, Yu Yamaguchi

PDF

5 Datasets 1 Video

TL;DR

STRIDE-QA is a large-scale, multi-sensor dataset designed for spatiotemporal reasoning in urban driving scenes, enabling better training and evaluation of vision-language models for autonomous driving.

Contribution

This paper introduces STRIDE-QA, the largest VQA dataset for urban driving, with novel QA tasks requiring spatial and temporal reasoning, and demonstrates its effectiveness in improving model performance.

Findings

01

Existing VLMs perform poorly on spatiotemporal tasks

02

Fine-tuning on STRIDE-QA significantly improves VLM performance

03

STRIDE-QA enables development of safer autonomous driving systems

Abstract

Vision-Language Models (VLMs) have been applied to autonomous driving to support decision-making in complex real-world scenarios. However, their training on static, web-sourced image-text pairs fundamentally limits the precise spatiotemporal reasoning required to understand and predict dynamic traffic scenes. We address this critical gap with STRIDE-QA, a large-scale visual question answering (VQA) dataset for physically grounded reasoning from an ego-centric perspective. Constructed from 100 hours of multi-sensor driving data in Tokyo, capturing diverse and challenging conditions, STRIDE-QA is the largest VQA dataset for spatiotemporal reasoning in urban driving, offering 16M QA pairs over 270K frames. Grounded by dense, automatically generated annotations including 3D bounding boxes, segmentation masks, and multi-object tracks, the dataset uniquely supports both object-centric and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

Videos

STRIDE-QA: Visual Question Answering Dataset for Spatiotemporal Reasoning in Urban Driving Scenes· underline