Vega: Learning to Drive with Natural Language Instructions

Sicheng Zuo; Yuxuan Li; Wenzhao Zheng; Zheng Zhu; Jie Zhou; Jiwen Lu

arXiv:2603.25741·cs.CV·March 31, 2026

Vega: Learning to Drive with Natural Language Instructions

Sicheng Zuo, Yuxuan Li, Wenzhao Zheng, Zheng Zhu, Jie Zhou, Jiwen Lu

PDF

TL;DR

Vega is a unified vision-language-world-action model that enables personalized autonomous driving by following diverse natural language instructions, utilizing a large-scale dataset and advanced multimodal processing techniques.

Contribution

The paper introduces Vega, a novel multimodal model that integrates vision, language, and world modeling for instruction-based driving, along with a large annotated dataset.

Findings

01

Vega achieves superior planning performance.

02

The model exhibits strong instruction-following abilities.

03

Extensive experiments validate the effectiveness of the approach.

Abstract

Vision-language-action models have reshaped autonomous driving to incorporate languages into the decision-making process. However, most existing pipelines only utilize the language modality for scene descriptions or reasoning and lack the flexibility to follow diverse user instructions for personalized driving. To address this, we first construct a large-scale driving dataset (InstructScene) containing around 100,000 scenes annotated with diverse driving instructions with the corresponding trajectories. We then propose a unified Vision-Language-World-Action model, Vega, for instruction-based generation and planning. We employ the autoregressive paradigm to process visual inputs (vision) and language instructions (language) and the diffusion paradigm to generate future predictions (world modeling) and trajectories (action). We perform joint attention to enable interactions between the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.