Goal-Based Vision-Language Driving
Santosh Patapati, Trisanth Srinivasan

TL;DR
NovaDrive is a real-time vision-language architecture for autonomous driving that integrates multiple sensor inputs and textual waypoints, improving safety, efficiency, and route optimality without recurrent memory.
Contribution
It introduces a novel single-branch vision-language model with cross-attention fusion and a smoothness loss, eliminating recurrent memory and enhancing driving performance.
Findings
Achieves 84% success rate on MD-NEX Outdoor benchmark
Reduces collision rate from 2.6% to 1.2%
Improves path efficiency (SPL) to 0.66
Abstract
Autonomous vehicles must react in milliseconds while reasoning about road geometry and traffic intent to navigate complex situations. We introduce NovaDrive, a single-branch vision-language architecture that processes front-camera images, HD-map tiles, LiDAR depth, and textual waypoints in a single branch. A lightweight, two-stage cross-attention block first aligns waypoint tokens with the HD map, then refines attention over fine-grained image and depth patches. Coupled with a novel smoothness loss that discourages abrupt steering and speed changes, this design eliminates the need for recurrent memory. We fine-tune the top 15 layers of an 11B LLaMA-3.2 vision-language backbone, enabling real-time inference. On the nuScenes / Waymo subset of the MD-NEX Outdoor benchmark, NovaDrive raises success rate to 84% (+4%), boosts path-efficiency (SPL) to 0.66 (+0.11), and reduces collision…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Image and Video Retrieval Techniques · Advanced Neural Network Applications · Robotics and Sensor-Based Localization
