Hierarchical Question-Answering for Driving Scene Understanding Using Vision-Language Models
Safaa Abdullahi Moallim Mohamud, Minjin Baek, and Dong Seog Han

TL;DR
This paper introduces a hierarchical question-answering method using vision-language models for efficient and detailed scene understanding in autonomous driving, balancing accuracy with low inference time.
Contribution
It proposes a novel hierarchical QA strategy with dynamic question skipping and custom dataset fine-tuning for improved scene comprehension in autonomous vehicles.
Findings
Competitive performance with GPT-4o in scene detail capture
Significantly reduced inference time compared to state-of-the-art methods
Effective real-time deployment in driving scenarios
Abstract
In this paper, we present a hierarchical question-answering (QA) approach for scene understanding in autonomous vehicles, balancing cost-efficiency with detailed visual interpretation. The method fine-tunes a compact vision-language model (VLM) on a custom dataset specific to the geographical area in which the vehicle operates to capture key driving-related visual elements. At the inference stage, the hierarchical QA strategy decomposes the scene understanding task into high-level and detailed sub-questions. Instead of generating lengthy descriptions, the VLM navigates a structured question tree, where answering high-level questions (e.g., "Is it possible for the ego vehicle to turn left at the intersection?") triggers more detailed sub-questions (e.g., "Is there a vehicle approaching the intersection from the opposite direction?"). To optimize inference time, questions are dynamically…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Topic Modeling
