CoWorld-VLA: Thinking in a Multi-Expert World Model for Autonomous Driving

Minqing Huang; Yujiao Xiang; Zihan Liang; Jiajie Huang; Jingqi Wang; Zhi Xu; Feiyang Tan; Hangning Zhou; Mu Yang; Gong Che

arXiv:2605.10426·cs.CV·May 14, 2026

CoWorld-VLA: Thinking in a Multi-Expert World Model for Autonomous Driving

Minqing Huang, Yujiao Xiang, Zihan Liang, Jiajie Huang, Jingqi Wang, Zhi Xu, Feiyang Tan, Hangning Zhou, Mu Yang, Gong Che

PDF

1 Repo

TL;DR

CoWorld-VLA introduces a multi-expert reasoning framework with explicit world representations to improve autonomous driving planning, demonstrating competitive results in scene generation and trajectory accuracy.

Contribution

It proposes a novel multi-expert world reasoning framework with explicit tokens and a diffusion-based planner for end-to-end autonomous driving.

Findings

01

Achieves strong performance in collision avoidance and trajectory accuracy.

02

Validates the effectiveness of expert tokens as planning conditions.

03

Demonstrates competitive results on the NAVSIM v1 benchmark.

Abstract

Vision-Language-Action (VLA) models have emerged as a promising paradigm for end-to-end autonomous driving. However, existing reasoning mechanisms still struggle to provide planning-oriented intermediate representations: textual Chain-of-Thought (CoT) fails to preserve continuous spatiotemporal structure, while latent world reasoning remains difficult to use as a direct condition for action generation. In this paper, we propose CoWorld-VLA, a multi-expert world reasoning framework for autonomous driving, where world representations serve as explicit conditions to guide action planning. CoWorld-VLA extracts complementary world information through multi-source supervision and encodes it into expert tokens within the VLA, thereby providing planner-accessible conditioning signals. Specifically, we construct four types of tokens: semantic interaction, geometric structure, dynamic evolution,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

AFARI-Research/CoWorld-VLA
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.