Hierarchical Question-Answering for Driving Scene Understanding Using Vision-Language Models

Safaa Abdullahi Moallim Mohamud; Minjin Baek; and Dong Seog Han

arXiv:2506.02615·cs.CV·June 4, 2025

Hierarchical Question-Answering for Driving Scene Understanding Using Vision-Language Models

Safaa Abdullahi Moallim Mohamud, Minjin Baek, and Dong Seog Han

PDF

Open Access

TL;DR

This paper introduces a hierarchical question-answering method using vision-language models for efficient and detailed scene understanding in autonomous driving, balancing accuracy with low inference time.

Contribution

It proposes a novel hierarchical QA strategy with dynamic question skipping and custom dataset fine-tuning for improved scene comprehension in autonomous vehicles.

Findings

01

Competitive performance with GPT-4o in scene detail capture

02

Significantly reduced inference time compared to state-of-the-art methods

03

Effective real-time deployment in driving scenarios

Abstract

In this paper, we present a hierarchical question-answering (QA) approach for scene understanding in autonomous vehicles, balancing cost-efficiency with detailed visual interpretation. The method fine-tunes a compact vision-language model (VLM) on a custom dataset specific to the geographical area in which the vehicle operates to capture key driving-related visual elements. At the inference stage, the hierarchical QA strategy decomposes the scene understanding task into high-level and detailed sub-questions. Instead of generating lengthy descriptions, the VLM navigates a structured question tree, where answering high-level questions (e.g., "Is it possible for the ego vehicle to turn left at the intersection?") triggers more detailed sub-questions (e.g., "Is there a vehicle approaching the intersection from the opposite direction?"). To optimize inference time, questions are dynamically…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Topic Modeling