DriveXQA: Cross-modal Visual Question Answering for Adverse Driving Scene Understanding

Mingzhe Tao; Ruiping Liu; Junwei Zheng; Yufan Chen; Kedi Ying; M. Saquib Sarfraz; Kailun Yang; Jiaming Zhang; Rainer Stiefelhagen

arXiv:2603.11380·cs.CV·March 26, 2026

DriveXQA: Cross-modal Visual Question Answering for Adverse Driving Scene Understanding

Mingzhe Tao, Ruiping Liu, Junwei Zheng, Yufan Chen, Kedi Ying, M. Saquib Sarfraz, Kailun Yang, Jiaming Zhang, Rainer Stiefelhagen

PDF

Open Access

TL;DR

DriveXQA introduces a multimodal dataset and a novel architecture for visual question answering in adverse driving conditions, enhancing autonomous vehicle scene understanding with multiple sensor modalities.

Contribution

The paper presents DriveXQA, a new multimodal dataset for autonomous driving VQA, and MVX-LLM, a novel token-efficient architecture with Dual Cross-Attention for multi-sensor data fusion.

Findings

01

MVX-LLM improves performance in foggy conditions

02

DriveXQA dataset includes diverse sensor failure and weather scenarios

03

Enhanced understanding of adverse driving scenes

Abstract

Fusing sensors with complementary modalities is crucial for maintaining a stable and comprehensive understanding of abnormal driving scenes. However, Multimodal Large Language Models (MLLMs) are underexplored for leveraging multi-sensor information to understand adverse driving scenarios in autonomous vehicles. To address this gap, we propose the DriveXQA, a multimodal dataset for autonomous driving VQA. In addition to four visual modalities, five sensor failure cases, and five weather conditions, it includes $102, 505$ QA pairs categorized into three types: global scene level, allocentric level, and ego-vehicle centric level. Since no existing MLLM framework adopts multiple complementary visual modalities as input, we design MVX-LLM, a token-efficient architecture with a Dual Cross-Attention (DCA) projector that fuses the modalities to alleviate information redundancy. Experiments…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Neural Network Applications · Domain Adaptation and Few-Shot Learning