DriveMA: Rethinking Language Interfaces in Driving VLAs with One-Step Meta-Actions
Weicheng Zheng, Yixin Huang, Qiao Sun, Derun Li, Hang zhao

TL;DR
DriveMA introduces a simplified language interface using one-step meta-actions for driving vision-language models, improving scalability, inference efficiency, and state-of-the-art performance on driving benchmarks.
Contribution
The paper proposes one-step meta-actions as an effective alternative to complex reasoning chains in Driving VLAs, enabling scalable supervision and improved performance.
Findings
DriveMA achieves new state-of-the-art on Waymo End-to-End Driving Challenge.
One-step meta-actions outperform natural-language reasoning in efficiency and predictability.
Ablation studies confirm the practical advantages of meta-actions over finer-grained actions.
Abstract
Driving Vision-Language-Action Models (Driving VLAs) commonly introduce natural-language reasoning as an intermediate interface for end-to-end planning, but reasoning-centric interfaces face three practical bottlenecks: obtaining high-quality reasoning annotations is difficult, generating and understanding long reasoning chains is challenging for compact models, and inference latency is substantially increased. In this paper, we rethink the design of language interfaces in Driving VLAs and show that concise one-step meta-actions are a simple yet effective alternative to verbose reasoning. Meta-actions provide semantic decision grounding while remaining low-entropy, and being automatically derivable from expert trajectories, enabling scalable supervision and reliable trajectory conditioning. Building on this interface, we propose DriveMA, which combines action-centric supervised training…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
