Enhancing Vision-Language Models for Autonomous Driving through Task-Specific Prompting and Spatial Reasoning
Aodi Wu, Xubo Luo

TL;DR
This paper introduces a structured prompting framework with spatial reasoning for Vision-Language Models, significantly improving autonomous driving scene understanding accuracy in diverse and corrupted data scenarios.
Contribution
We develop a task-specific prompting framework with spatial reasoning and multi-view visual assembly, advancing VLM performance in autonomous driving tasks.
Findings
Achieved 70.87% accuracy on clean data
Achieved 72.85% accuracy on corrupted data
Structured prompts improve safety-critical VLM tasks
Abstract
This technical report presents our solution for the RoboSense Challenge at IROS 2025, which evaluates Vision-Language Models (VLMs) on autonomous driving scene understanding across perception, prediction, planning, and corruption detection tasks. We propose a systematic framework built on four core components. First, a Mixture-of-Prompts router classifies questions and dispatches them to task-specific expert prompts, eliminating interference across diverse question types. Second, task-specific prompts embed explicit coordinate systems, spatial reasoning rules, role-playing, Chain-of-Thought/Tree-of-Thought reasoning, and few-shot examples tailored to each task. Third, a visual assembly module composes multi-view images with object crops, magenta markers, and adaptive historical frames based on question requirements. Fourth, we configure model inference parameters (temperature, top-p,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
