Vision-Language Cross-Attention for Real-Time Autonomous Driving
Santosh Patapati, Trisanth Srinivasan, Murari Ambati

TL;DR
This paper introduces XYZ-Drive, a vision-language model for autonomous driving that fuses image, map, and waypoint data using cross-attention, achieving high success rates and safety improvements in real-time navigation.
Contribution
The paper presents a novel single-model approach with goal-centered cross-attention for integrating vision, map, and waypoint data in autonomous driving, improving efficiency and performance.
Findings
XYZ-Drive achieves 95% success rate on MD-NEX benchmark.
Removing any modality reduces success, confirming their importance.
Query-based fusion outperforms simple concatenation in performance.
Abstract
Autonomous cars need geometric accuracy and semantic understanding to navigate complex environments, yet most stacks handle them separately. We present XYZ-Drive, a single vision-language model that reads a front-camera frame, a 25m 25m overhead map, and the next waypoint, then outputs steering and speed. A lightweight goal-centered cross-attention layer lets waypoint tokens highlight relevant image and map patches, supporting both action and textual explanations, before the fused tokens enter a partially fine-tuned LLaMA-3.2 11B model. On the MD-NEX Outdoor-Driving benchmark XYZ-Drive attains 95% success and 0.80 Success weighted by Path Length (SPL), surpassing PhysNav-DG by 15%. and halving collisions, all while significantly improving efficiency by using only a single branch. Sixteen ablations explain the gains. Removing any modality (vision, waypoint, map) drops success by…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAutonomous Vehicle Technology and Safety · Advanced Neural Network Applications · Multimodal Machine Learning Applications
