DriveXQA: Cross-modal Visual Question Answering for Adverse Driving Scene Understanding
Mingzhe Tao, Ruiping Liu, Junwei Zheng, Yufan Chen, Kedi Ying, M. Saquib Sarfraz, Kailun Yang, Jiaming Zhang, Rainer Stiefelhagen

TL;DR
DriveXQA introduces a multimodal dataset and a novel architecture for visual question answering in adverse driving conditions, enhancing autonomous vehicle scene understanding with multiple sensor modalities.
Contribution
The paper presents DriveXQA, a new multimodal dataset for autonomous driving VQA, and MVX-LLM, a novel token-efficient architecture with Dual Cross-Attention for multi-sensor data fusion.
Findings
MVX-LLM improves performance in foggy conditions
DriveXQA dataset includes diverse sensor failure and weather scenarios
Enhanced understanding of adverse driving scenes
Abstract
Fusing sensors with complementary modalities is crucial for maintaining a stable and comprehensive understanding of abnormal driving scenes. However, Multimodal Large Language Models (MLLMs) are underexplored for leveraging multi-sensor information to understand adverse driving scenarios in autonomous vehicles. To address this gap, we propose the DriveXQA, a multimodal dataset for autonomous driving VQA. In addition to four visual modalities, five sensor failure cases, and five weather conditions, it includes QA pairs categorized into three types: global scene level, allocentric level, and ego-vehicle centric level. Since no existing MLLM framework adopts multiple complementary visual modalities as input, we design MVX-LLM, a token-efficient architecture with a Dual Cross-Attention (DCA) projector that fuses the modalities to alleviate information redundancy. Experiments…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Neural Network Applications · Domain Adaptation and Few-Shot Learning
