Goal-Based Vision-Language Driving

Santosh Patapati; Trisanth Srinivasan

arXiv:2507.23042·cs.CV·October 14, 2025

Goal-Based Vision-Language Driving

Santosh Patapati, Trisanth Srinivasan

PDF

Open Access

TL;DR

NovaDrive is a real-time vision-language architecture for autonomous driving that integrates multiple sensor inputs and textual waypoints, improving safety, efficiency, and route optimality without recurrent memory.

Contribution

It introduces a novel single-branch vision-language model with cross-attention fusion and a smoothness loss, eliminating recurrent memory and enhancing driving performance.

Findings

01

Achieves 84% success rate on MD-NEX Outdoor benchmark

02

Reduces collision rate from 2.6% to 1.2%

03

Improves path efficiency (SPL) to 0.66

Abstract

Autonomous vehicles must react in milliseconds while reasoning about road geometry and traffic intent to navigate complex situations. We introduce NovaDrive, a single-branch vision-language architecture that processes front-camera images, HD-map tiles, LiDAR depth, and textual waypoints in a single branch. A lightweight, two-stage cross-attention block first aligns waypoint tokens with the HD map, then refines attention over fine-grained image and depth patches. Coupled with a novel smoothness loss that discourages abrupt steering and speed changes, this design eliminates the need for recurrent memory. We fine-tune the top 15 layers of an 11B LLaMA-3.2 vision-language backbone, enabling real-time inference. On the nuScenes / Waymo subset of the MD-NEX Outdoor benchmark, NovaDrive raises success rate to 84% (+4%), boosts path-efficiency (SPL) to 0.66 (+0.11), and reduces collision…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Image and Video Retrieval Techniques · Advanced Neural Network Applications · Robotics and Sensor-Based Localization