V2X-UniPool: Unifying Multimodal Perception and Knowledge Reasoning for Autonomous Driving
Xuewen Luo, Fengze Yang, Fan Ding, Xiangbo Gao, Shuo Xing, Yang Zhou, Zhengzhong Tu, and Chenxi Liu

TL;DR
V2X-UniPool is a novel framework that integrates multimodal V2X perception with language-based reasoning, enhancing autonomous driving safety and efficiency by unifying sensor data and knowledge reasoning.
Contribution
It is the first to unify V2X perception with language reasoning, transforming multimodal data into structured knowledge for improved decision-making in autonomous driving.
Findings
Achieves state-of-the-art planning accuracy and safety.
Reduces communication cost by over 80%.
Maintains low overhead compared to other methods.
Abstract
Autonomous driving (AD) has achieved significant progress, yet single-vehicle perception remains constrained by sensing range and occlusions. Vehicle-to-Everything (V2X) communication addresses these limits by enabling collaboration across vehicles and infrastructure, but it also faces heterogeneity, synchronization, and latency constraints. Language models offer strong knowledge-driven reasoning and decision-making capabilities, but they are not inherently designed to process raw sensor streams and are prone to hallucination. We propose V2X-UniPool, the first framework that unifies V2X perception with language-based reasoning for knowledge-driven AD. It transforms multimodal V2X data into structured, language-based knowledge, organizes it in a time-indexed knowledge pool for temporally consistent reasoning, and employs Retrieval-Augmented Generation (RAG) to ground decisions in…
Peer Reviews
Decision·Submitted to ICLR 2026
- The concept of a centralized knowledge pool is innovative and well-motivated. It effectively abstracts away sensor heterogeneity and provides an interpretable interface for vehicles. - The paper provides thorough experiments and ablation studies. The reported performance is strong. - The methodology is described in detail with clear components for knowledge translation, pool construction (static/dynamic), and RAG-based retrieval.
- The paper's core motivation is that single-vehicle perception is limited and occluded, leading drivers/AI to blindly trust incomplete information and cause accidents. However, the proposed solution replaces one form of blind trust with another. It shifts trust from the vehicle's own sensors to the centralized V2X system. For example, a person can maliciously place fake "construction ahead" sign that could fool the RSU's vision model, polluting the entire Knowledge Pool. This would cause all c
1. The proposed framework, V2X-UniPool, using natural language as the only form to transmit information from the infrastructure to vehicles, significantly reduces the communication cost. 2. The design of Static/Dynamic, High Frequency/Low Frequency pool systematically organizes the information stored in the infrastructure.
### Major Weaknesses 1. **The performance gain compared with the previous SoTA (V2X-VLM) is not significant.** - Compared with full configuration of V2X-VLM, the L2 error is much higher. Notably, while V2X intuitively should provide more performance gain in long-term planning (Just as the authors stated, V2X helps extend perception range), V2X-UniPool shows an even larger gap on 3.5s and 4.5s error. - Compared with the no-fusion result from V2X-VLM, which achieves an average L2 of 1.49m, the 1.4
* V2X-UniPool bridges perception-centric V2X with language-centric planning via a time-indexed knowledge pool and RAG grounding, explicitly tackling heterogeneity, temporal desynchronization, and hallucination. * The static/dynamic pool provides multi-resolution temporal semantics, and ablations across model architectures show consistent gains.
* The latency and network assumptions may be optimistic. There’s no robustness study for lower bandwidth, variable latency, or packet loss. * Results are reported only on DAIR-V2X-Seq and in open-loop. the paper note end performance still depends on the chosen vehicle-side AD model and plan to pursue closed-loop validation, so real-world robustness and controller interaction effects remain untested.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling
