Robust Driving QA through Metadata-Grounded Context and Task-Specific Prompts

Seungjun Yu; Junsung Park; Youngsun Lim; Hyunjung Shim

arXiv:2510.19001·cs.CV·October 23, 2025

Robust Driving QA through Metadata-Grounded Context and Task-Specific Prompts

Seungjun Yu, Junsung Park, Youngsun Lim, Hyunjung Shim

PDF

Open Access

TL;DR

This paper introduces a two-phase vision-language question answering system for autonomous driving that leverages metadata and task-specific prompts to improve accuracy and robustness in high-level perception, prediction, and planning questions.

Contribution

It presents a novel two-phase approach combining large multimodal LLMs with metadata-grounded prompts and ensemble methods for enhanced driving QA performance.

Findings

01

Achieves 67.37% overall accuracy on a driving QA benchmark.

02

Maintains 96% accuracy under severe visual corruption.

03

Self-consistency ensemble improves answer reliability.

Abstract

We present a two-phase vision-language QA system for autonomous driving that answers high-level perception, prediction, and planning questions. In Phase-1, a large multimodal LLM (Qwen2.5-VL-32B) is conditioned on six-camera inputs, a short temporal window of history, and a chain-of-thought prompt with few-shot exemplars. A self-consistency ensemble (multiple sampled reasoning chains) further improves answer reliability. In Phase-2, we augment the prompt with nuScenes scene metadata (object annotations, ego-vehicle state, etc.) and category-specific question instructions (separate prompts for perception, prediction, planning tasks). In experiments on a driving QA benchmark, our approach significantly outperforms the baseline Qwen2.5 models. For example, using 5 history frames and 10-shot prompting in Phase-1 yields 65.1% overall accuracy (vs.62.61% with zero-shot); applying…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Topic Modeling · Natural Language Processing Techniques