Enhancing Vision-Language Models for Autonomous Driving through Task-Specific Prompting and Spatial Reasoning

Aodi Wu; Xubo Luo

arXiv:2510.24152·cs.CV·October 29, 2025

Enhancing Vision-Language Models for Autonomous Driving through Task-Specific Prompting and Spatial Reasoning

Aodi Wu, Xubo Luo

PDF

TL;DR

This paper introduces a structured prompting framework with spatial reasoning for Vision-Language Models, significantly improving autonomous driving scene understanding accuracy in diverse and corrupted data scenarios.

Contribution

We develop a task-specific prompting framework with spatial reasoning and multi-view visual assembly, advancing VLM performance in autonomous driving tasks.

Findings

01

Achieved 70.87% accuracy on clean data

02

Achieved 72.85% accuracy on corrupted data

03

Structured prompts improve safety-critical VLM tasks

Abstract

This technical report presents our solution for the RoboSense Challenge at IROS 2025, which evaluates Vision-Language Models (VLMs) on autonomous driving scene understanding across perception, prediction, planning, and corruption detection tasks. We propose a systematic framework built on four core components. First, a Mixture-of-Prompts router classifies questions and dispatches them to task-specific expert prompts, eliminating interference across diverse question types. Second, task-specific prompts embed explicit coordinate systems, spatial reasoning rules, role-playing, Chain-of-Thought/Tree-of-Thought reasoning, and few-shot examples tailored to each task. Third, a visual assembly module composes multi-view images with object crops, magenta markers, and adaptive historical frames based on question requirements. Fourth, we configure model inference parameters (temperature, top-p,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.