A BEV-Fusion Based Framework for Sequential Multi-Modal Beam Prediction in mmWave Systems
Jiaming Zeng, Cunhua Pan, Haoyang Weng, Ruijing Liu, Hong Ren, and Jiangzhou Wang

TL;DR
This paper introduces a BEV-Fusion framework that combines multiple sensor modalities in bird's-eye-view space for improved beam prediction in mmWave vehicular systems, reducing overhead.
Contribution
It proposes a novel BEV-based fusion method with a learned camera-to-BEV module and temporal transformer for motion-aware beam prediction, outperforming prior approaches.
Findings
Achieves approximately 87% distance-based accuracy on DeepSense 6G benchmark scenarios.
Outperforms the TransFuser baseline in multi-modal beam prediction.
Demonstrates the effectiveness of BEV-space fusion for sensing-assisted beam prediction.
Abstract
Beam prediction is critical for reducing beam-training overhead in millimeter-wave (mmWave) systems, especially in high-mobility vehicular scenarios. This paper presents a BEV-Fusion based framework that unifies camera, LiDAR, radar, and GPS modalities in a shared bird's-eye-view (BEV) representation for spatially consistent multi-modal fusion. Unlike priorapproaches that fuse globally pooled one-dimensional features, the proposed method performs fusion in BEV space to preservecross-modal geometric structure and visual semantic density. A learned camera-to-BEV module based on cross-attention is adopted to generate BEV-aligned visual features without relying on precise camera calibration, and a temporal transformer is used to aggregate five-step sequential observations for motion-aware beam prediction. Experiments on the DeepSense 6G benchmark show that BEV-Fusion achieves approximately…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
