End-to-End 3D Spatiotemporal Perception with Multimodal Fusion and V2X Collaboration

Zhenwei Yang; Yibo Ai; Weidong Zhang

arXiv:2512.21831·cs.CV·December 29, 2025

End-to-End 3D Spatiotemporal Perception with Multimodal Fusion and V2X Collaboration

Zhenwei Yang, Yibo Ai, Weidong Zhang

PDF

Open Access

TL;DR

This paper introduces XET-V2X, an end-to-end framework that fuses multi-view multimodal data and enables V2X collaboration for robust 3D perception in autonomous driving, effectively handling occlusions and communication delays.

Contribution

It proposes a novel multi-modal fusion and V2X collaboration framework with a dual-layer spatial cross-attention module for efficient multi-view and multi-modal alignment.

Findings

01

Improves detection and tracking accuracy under communication delays.

02

Achieves robust perception in complex traffic scenarios.

03

Demonstrates effectiveness on real-world and simulated datasets.

Abstract

Multi-view cooperative perception and multimodal fusion are essential for reliable 3D spatiotemporal understanding in autonomous driving, especially under occlusions, limited viewpoints, and communication delays in V2X scenarios. This paper proposes XET-V2X, a multi-modal fused end-to-end tracking framework for v2x collaboration that unifies multi-view multimodal sensing within a shared spatiotemporal representation. To efficiently align heterogeneous viewpoints and modalities, XET-V2X introduces a dual-layer spatial cross-attention module based on multi-scale deformable attention. Multi-view image features are first aggregated to enhance semantic consistency, followed by point cloud fusion guided by the updated spatial queries, enabling effective cross-modal interaction while reducing computational overhead. Experiments on the real-world V2X-Seq-SPD dataset and the simulated…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAutonomous Vehicle Technology and Safety · Advanced Neural Network Applications · Video Surveillance and Tracking Methods