DriveMA: Rethinking Language Interfaces in Driving VLAs with One-Step Meta-Actions

Weicheng Zheng; Yixin Huang; Qiao Sun; Derun Li; Hang zhao

arXiv:2605.21273·cs.CV·May 22, 2026

DriveMA: Rethinking Language Interfaces in Driving VLAs with One-Step Meta-Actions

Weicheng Zheng, Yixin Huang, Qiao Sun, Derun Li, Hang zhao

PDF

TL;DR

DriveMA introduces a simplified language interface using one-step meta-actions for driving vision-language models, improving scalability, inference efficiency, and state-of-the-art performance on driving benchmarks.

Contribution

The paper proposes one-step meta-actions as an effective alternative to complex reasoning chains in Driving VLAs, enabling scalable supervision and improved performance.

Findings

01

DriveMA achieves new state-of-the-art on Waymo End-to-End Driving Challenge.

02

One-step meta-actions outperform natural-language reasoning in efficiency and predictability.

03

Ablation studies confirm the practical advantages of meta-actions over finer-grained actions.

Abstract

Driving Vision-Language-Action Models (Driving VLAs) commonly introduce natural-language reasoning as an intermediate interface for end-to-end planning, but reasoning-centric interfaces face three practical bottlenecks: obtaining high-quality reasoning annotations is difficult, generating and understanding long reasoning chains is challenging for compact models, and inference latency is substantially increased. In this paper, we rethink the design of language interfaces in Driving VLAs and show that concise one-step meta-actions are a simple yet effective alternative to verbose reasoning. Meta-actions provide semantic decision grounding while remaining low-entropy, and being automatically derivable from expert trajectories, enabling scalable supervision and reliable trajectory conditioning. Building on this interface, we propose DriveMA, which combines action-centric supervised training…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.