End-to-End 3D Spatiotemporal Perception with Multimodal Fusion and V2X Collaboration
Zhenwei Yang, Yibo Ai, Weidong Zhang

TL;DR
This paper introduces XET-V2X, an end-to-end framework that fuses multi-view multimodal data and enables V2X collaboration for robust 3D perception in autonomous driving, effectively handling occlusions and communication delays.
Contribution
It proposes a novel multi-modal fusion and V2X collaboration framework with a dual-layer spatial cross-attention module for efficient multi-view and multi-modal alignment.
Findings
Improves detection and tracking accuracy under communication delays.
Achieves robust perception in complex traffic scenarios.
Demonstrates effectiveness on real-world and simulated datasets.
Abstract
Multi-view cooperative perception and multimodal fusion are essential for reliable 3D spatiotemporal understanding in autonomous driving, especially under occlusions, limited viewpoints, and communication delays in V2X scenarios. This paper proposes XET-V2X, a multi-modal fused end-to-end tracking framework for v2x collaboration that unifies multi-view multimodal sensing within a shared spatiotemporal representation. To efficiently align heterogeneous viewpoints and modalities, XET-V2X introduces a dual-layer spatial cross-attention module based on multi-scale deformable attention. Multi-view image features are first aggregated to enhance semantic consistency, followed by point cloud fusion guided by the updated spatial queries, enabling effective cross-modal interaction while reducing computational overhead. Experiments on the real-world V2X-Seq-SPD dataset and the simulated…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAutonomous Vehicle Technology and Safety · Advanced Neural Network Applications · Video Surveillance and Tracking Methods
