V2X-QA: A Comprehensive Reasoning Dataset and Benchmark for Multimodal Large Language Models in Autonomous Driving Across Ego, Infrastructure, and Cooperative Views

Junwei You; Pei Li; Zhuoyu Jiang; Weizhe Tang; Zilin Huang; Rui Gan; Jiaxi Liu; Yan Zhao; Sikai Chen; Bin Ran

arXiv:2604.02710·cs.RO·April 6, 2026

V2X-QA: A Comprehensive Reasoning Dataset and Benchmark for Multimodal Large Language Models in Autonomous Driving Across Ego, Infrastructure, and Cooperative Views

Junwei You, Pei Li, Zhuoyu Jiang, Weizhe Tang, Zilin Huang, Rui Gan, Jiaxi Liu, Yan Zhao, Sikai Chen, Bin Ran

PDF

1 Repo

TL;DR

V2X-QA introduces a comprehensive dataset and benchmark for evaluating multimodal large language models in autonomous driving across ego, infrastructure, and cooperative views, emphasizing viewpoint-dependent reasoning and model performance.

Contribution

This work presents V2X-QA, a novel multi-view dataset and benchmark with a view-decoupled evaluation protocol for autonomous driving models, including a baseline model V2X-MoE with explicit view routing.

Findings

01

Viewpoint accessibility significantly impacts model performance.

02

Infrastructure-side reasoning enhances traffic understanding.

03

Cooperative reasoning remains challenging due to cross-view alignment requirements.

Abstract

Multimodal large language models (MLLMs) have shown strong potential for autonomous driving, yet existing benchmarks remain largely ego-centric and therefore cannot systematically assess model performance in infrastructure-centric and cooperative driving conditions. In this work, we introduce V2X-QA, a real-world dataset and benchmark for evaluating MLLMs across vehicle-side, infrastructure-side, and cooperative viewpoints. V2X-QA is built around a view-decoupled evaluation protocol that enables controlled comparison under vehicle-only, infrastructure-only, and cooperative driving conditions within a unified multiple-choice question answering (MCQA) framework. The benchmark is organized into a twelve-task taxonomy spanning perception, prediction, and reasoning and planning, and is constructed through expert-verified MCQA annotation to enable fine-grained diagnosis of viewpoint-dependent…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

junwei0001/V2X-QA
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.