TL;DR
ExploreVLA introduces a unified framework combining world modeling and exploration for end-to-end autonomous driving, enhancing policy learning beyond imitation through dense supervision and intrinsic rewards.
Contribution
It proposes a novel approach that integrates world modeling with exploration signals, improving autonomous driving policies in diverse scenarios.
Findings
Achieved state-of-the-art scores on NAVSIM benchmark.
Demonstrated effective exploration using image prediction uncertainty.
Enhanced visual and geometric representations for planning.
Abstract
End-to-end autonomous driving models based on Vision-Language-Action (VLA) architectures have shown promising results by learning driving policies through behavior cloning on expert demonstrations. However, imitation learning inherently limits the model to replicating observed behaviors without exploring diverse driving strategies, leaving it brittle in novel or out-of-distribution scenarios. Reinforcement learning (RL) offers a natural remedy by enabling policy exploration beyond the expert distribution. Yet VLA models, typically trained on offline datasets, lack directly observable state transitions, necessitating a learned world model to anticipate action consequences. In this work, we propose a unified understanding-and-generation framework that leverages world modeling to simultaneously enable meaningful exploration and provide dense supervision. Specifically, we augment trajectory…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
