TARDIS STRIDE: A Spatio-Temporal Road Image Dataset and World Model for Autonomy
H\'ector Carri\'on, Yutong Bai, V\'ictor A. Hern\'andez Castro, Kishan Panaganti, Ayush Zenith, Matthew Trang, Tony Zhang, Pietro Perona, Jitendra Malik

TL;DR
This paper introduces STRIDE, a comprehensive spatio-temporal road image dataset, and TARDIS, a transformer-based world model that effectively captures environment dynamics for autonomous agent tasks.
Contribution
It presents a novel dataset and a unified autoregressive transformer model for modeling complex spatio-temporal environment dynamics in autonomous systems.
Findings
Robust performance in image synthesis and instruction following.
State-of-the-art results in georeferencing tasks.
Demonstrates potential for generalist autonomous agents.
Abstract
World models aim to simulate environments and enable effective agent behavior. However, modeling real-world environments presents unique challenges as they dynamically change across both space and, crucially, time. To capture these composed dynamics, we introduce a Spatio-Temporal Road Image Dataset for Exploration (STRIDE) permuting 360-degree panoramic imagery into rich interconnected observation, state and action nodes. Leveraging this structure, we can simultaneously model the relationship between egocentric views, positional coordinates, and movement commands across both space and time. We benchmark this dataset via TARDIS, a transformer-based generative world model that integrates spatial and temporal dynamics through a unified autoregressive framework trained on STRIDE. We demonstrate robust performance across a range of agentic tasks such as controllable photorealistic image…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAutonomous Vehicle Technology and Safety · Generative Adversarial Networks and Image Synthesis · Multimodal Machine Learning Applications
