TL;DR
CoWorld-VLA introduces a multi-expert reasoning framework with explicit world representations to improve autonomous driving planning, demonstrating competitive results in scene generation and trajectory accuracy.
Contribution
It proposes a novel multi-expert world reasoning framework with explicit tokens and a diffusion-based planner for end-to-end autonomous driving.
Findings
Achieves strong performance in collision avoidance and trajectory accuracy.
Validates the effectiveness of expert tokens as planning conditions.
Demonstrates competitive results on the NAVSIM v1 benchmark.
Abstract
Vision-Language-Action (VLA) models have emerged as a promising paradigm for end-to-end autonomous driving. However, existing reasoning mechanisms still struggle to provide planning-oriented intermediate representations: textual Chain-of-Thought (CoT) fails to preserve continuous spatiotemporal structure, while latent world reasoning remains difficult to use as a direct condition for action generation. In this paper, we propose CoWorld-VLA, a multi-expert world reasoning framework for autonomous driving, where world representations serve as explicit conditions to guide action planning. CoWorld-VLA extracts complementary world information through multi-source supervision and encodes it into expert tokens within the VLA, thereby providing planner-accessible conditioning signals. Specifically, we construct four types of tokens: semantic interaction, geometric structure, dynamic evolution,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
