Natural Language Instructions for Scene-Responsive Human-in-the-Loop Motion Planning in Autonomous Driving using Vision-Language-Action Models
Angel Martinez-Sanchez, Parthib Roy, Ross Greer

TL;DR
This paper introduces doScenes, a real-world dataset linking free-form instructions to vehicle trajectories, and adapts an open-source vision-language model to improve instruction-conditioned autonomous driving in complex scenes.
Contribution
It presents the first real-world dataset for instruction-guided driving and demonstrates how integrating natural language prompts enhances trajectory planning robustness and accuracy.
Findings
Instruction conditioning reduces mean ADE by 98.7%.
Well-phrased prompts improve trajectory alignment by up to 5.1%.
Reproducible baseline and evaluation scripts are provided.
Abstract
Instruction-grounded driving, where passenger language guides trajectory planning, requires vehicles to understand intent before motion. However, most prior instruction-following planners rely on simulation or fixed command vocabularies, limiting real-world generalization. doScenes, the first real-world dataset linking free-form instructions (with referentiality) to nuScenes ground-truth motion, enables instruction-conditioned planning. In this work, we adapt OpenEMMA, an open-source MLLM-based end-to-end driving framework that ingests front-camera views and ego-state and outputs 10-step speed-curvature trajectories, to this setting, presenting a reproducible instruction-conditioned baseline on doScenes and investigate the effects of human instruction prompts on predicted driving behavior. We integrate doScenes directives as passenger-style prompts within OpenEMMA's vision-language…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAutonomous Vehicle Technology and Safety · Robotic Path Planning Algorithms · Multimodal Machine Learning Applications
