Vega: Learning to Drive with Natural Language Instructions
Sicheng Zuo, Yuxuan Li, Wenzhao Zheng, Zheng Zhu, Jie Zhou, Jiwen Lu

TL;DR
Vega is a unified vision-language-world-action model that enables personalized autonomous driving by following diverse natural language instructions, utilizing a large-scale dataset and advanced multimodal processing techniques.
Contribution
The paper introduces Vega, a novel multimodal model that integrates vision, language, and world modeling for instruction-based driving, along with a large annotated dataset.
Findings
Vega achieves superior planning performance.
The model exhibits strong instruction-following abilities.
Extensive experiments validate the effectiveness of the approach.
Abstract
Vision-language-action models have reshaped autonomous driving to incorporate languages into the decision-making process. However, most existing pipelines only utilize the language modality for scene descriptions or reasoning and lack the flexibility to follow diverse user instructions for personalized driving. To address this, we first construct a large-scale driving dataset (InstructScene) containing around 100,000 scenes annotated with diverse driving instructions with the corresponding trajectories. We then propose a unified Vision-Language-World-Action model, Vega, for instruction-based generation and planning. We employ the autoregressive paradigm to process visual inputs (vision) and language instructions (language) and the diffusion paradigm to generate future predictions (world modeling) and trajectories (action). We perform joint attention to enable interactions between the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
