TL;DR
Uni-World VLA introduces an interleaved world modeling and planning approach for autonomous driving, enabling continuous, adaptive decision-making by alternating between predicting future observations and planning actions.
Contribution
The paper proposes a novel unified VLA model that tightly couples world prediction and planning through step-by-step interleaving, enhancing decision accuracy in dynamic environments.
Findings
Achieves competitive closed-loop planning performance on NAVSIM benchmark.
Produces high-fidelity future frame predictions with integrated monocular depth cues.
Demonstrates the effectiveness of interleaved modeling and planning for autonomous driving.
Abstract
Autonomous driving requires reasoning about how the environment evolves and planning actions accordingly. Existing world-model-based approaches typically predict future scenes first and plan afterwards, resulting in open-loop imagination that may drift from the actual decision process. In this paper, we present Uni-World VLA, a unified vision-language-action (VLA) model that tightly interleaves future frame prediction and trajectory planning. Instead of generating a full world rollout before planning, our model alternates between predicting future frames and ego actions step by step, allowing planning decisions to be continuously conditioned on the imagined future observations. This interleaved generation forms a closed-loop interaction between world modeling and control, enabling more adaptive decision-making in dynamic traffic scenarios. In addition, we incorporate monocular depth…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
