RoadSceneVQA: Benchmarking Visual Question Answering in Roadside Perception Systems for Intelligent Transportation System

Runwei Guan; Rongsheng Hu; Shangshu Chen; Ningyuan Xiao; Xue Xia; Jiayang Liu; Beibei Chen; Ziren Tang; Ningwei Ouyang; Shaofeng Liang; Yuxuan Fan; Wanjie Sun; Yutao Yue

arXiv:2511.18286·cs.CV·December 29, 2025

RoadSceneVQA: Benchmarking Visual Question Answering in Roadside Perception Systems for Intelligent Transportation System

Runwei Guan, Rongsheng Hu, Shangshu Chen, Ningyuan Xiao, Xue Xia, Jiayang Liu, Beibei Chen, Ziren Tang, Ningwei Ouyang, Shaofeng Liang, Yuxuan Fan, Wanjie Sun, Yutao Yue

PDF

Open Access

TL;DR

RoadSceneVQA introduces a large-scale dataset and novel reasoning modules for visual question answering in roadside traffic scenarios, enabling better interaction and understanding of traffic behaviors.

Contribution

The paper presents RoadSceneVQA dataset, new fusion and reasoning modules, and a baseline model for improved traffic scene understanding and reasoning.

Findings

01

Enhanced reasoning accuracy with proposed modules

02

State-of-the-art performance on traffic perception benchmarks

03

Improved computational efficiency in traffic scene reasoning

Abstract

Current roadside perception systems mainly focus on instance-level perception, which fall short in enabling interaction via natural language and reasoning about traffic behaviors in context. To bridge this gap, we introduce RoadSceneVQA, a large-scale and richly annotated visual question answering (VQA) dataset specifically tailored for roadside scenarios. The dataset comprises 34,736 diverse QA pairs collected under varying weather, illumination, and traffic conditions, targeting not only object attributes but also the intent, legality, and interaction patterns of traffic participants. RoadSceneVQA challenges models to perform both explicit recognition and implicit commonsense reasoning, grounded in real-world traffic rules and contextual dependencies. To fully exploit the reasoning potential of Multi-modal Large Language Models (MLLMs), we further propose CogniAnchor Fusion (CAF), a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Neural Network Applications · Domain Adaptation and Few-Shot Learning