TL;DR
ScenarioControl is a novel vision-language system for generating diverse, realistic 3D driving scenarios with fine-grained control over layout and traffic, supporting long-term, multi-view simulations.
Contribution
It introduces the first control mechanism for learned driving scenario generation that integrates multimodal inputs with a vectorized latent space.
Findings
Produces temporally consistent 3D scenarios from different viewpoints.
Achieves high control fidelity and realism compared to existing methods.
Supports long-horizon scenario continuation.
Abstract
We introduce ScenarioControl, the first vision-language control mechanism for learned driving scenario generation. Given a text prompt or an input image, Scenario-Control synthesizes diverse, realistic 3D scenario rollouts - including map, 3D boxes of reactive actors over time, pedestrians, driving infrastructure, and ego camera observations. The method generates scenes in a vectorized latent space that represents road structure and dynamic agents jointly. To connect multimodal control with sparse vectorized scene elements, we propose a cross-global control mechanism that integrates crossattention with a lightweight global-context branch, enabling fine-grained control over road layout and traffic conditions while preserving realism. The method produces temporally consistent scenario rollouts from the perspectives different actors in the scene, supporting long-horizon continuation of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
