SceneStreamer: Continuous Scenario Generation as Next Token Group Prediction
Zhenghao Peng, Yuxin Liu, Bolei Zhou

TL;DR
SceneStreamer is a transformer-based framework that generates continuous, realistic traffic scenarios for autonomous driving simulation, supporting long-duration, dynamic agent interactions and improving policy robustness.
Contribution
It introduces a novel autoregressive token-based approach for long-horizon traffic scenario generation, enabling dynamic agent management and realistic behavior modeling.
Findings
Produces diverse, realistic traffic scenarios
Enhances robustness of autonomous driving policies trained in simulation
Supports unbounded, long-duration scenario generation
Abstract
Realistic and interactive traffic simulation is essential for training and evaluating autonomous driving systems. However, most existing data-driven simulation methods rely on static initialization or log-replay data, limiting their ability to model dynamic, long-horizon scenarios with evolving agent populations. We propose SceneStreamer, a unified autoregressive framework for continuous scenario generation that represents the entire scene as a sequence of tokens, including traffic light signals, agent states, and motion vectors, and generates them step by step with a transformer model. This design enables SceneStreamer to continuously introduce and retire agents over an unbounded horizon, supporting realistic long-duration simulation. Experiments demonstrate that SceneStreamer produces realistic, diverse, and adaptive traffic behaviors. Furthermore, reinforcement learning policies…
Peer Reviews
Decision·ICLR 2026 Poster
*Originality: The core idea of framing multi-agent, dynamic scenario generation as a unified next-token prediction task using a single autoregressive model (InfGen) is highly original in this domain. Specifically, the autoregressive generation of agent states (Type, Map ID, Relative State tokens: ⟨SOA,TYPE,MS,RS⟩_t) anchored to map segments is a clever mechanism for achieving physically and semantically consistent agent initialization, which is a major advancement over prior non-causal "flat" de
1. Limited Motion Prediction Benchmarking: While the core focus is scenario generation, comparing InfGen's motion prediction performance only against its own ablated version (InfGen-Motion vs. InfGen-Full) is insufficient. The motion prediction task (Sec 3.2) is standard, and performance should be compared against state-of-the-art motion prediction baselines on the Waymo Open Motion Dataset (WOMD) to properly contextualize the model's trajectory-modeling capability. 2. Lack of Diversity Metrics
● Unified formulation: Modeling the entire scene as a next-token sequence provides a unified autoregressive framework capturing spatiotemporal dependencies among maps, lights, and agents. This “traffic as language” design enhances long-horizon consistency, supports multiple tasks, and enables seamless dynamic scene evolution. ● Dynamic agent injection: The model can add or remove agents at different timesteps, breaking from the fixed-agent assumption and better reflecting open-world traffic wher
● Limited performance on core WOMD metrics: Despite its novel formulation, InfGen does not achieve competitive results on the core WOMD leaderboard—particularly on mADE, which is the primary metric of the Waymo Challenge. Its overall scores lag behind recent strong baselines such as UniMM and CAT-K, raising concerns about whether the proposed architectural contributions and dynamic scenario generation truly translate into better motion accuracy or downstream utility. The claimed advantage in sup
1. Closed-loop simulation 2. Unified modeling of the whole scenario
1. Confusing positioned contribution and experimental setting 2. Lack of comprehensive comparison
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAutonomous Vehicle Technology and Safety · Traffic control and management · Traffic Prediction and Management Techniques
