A BEV-Fusion Based Framework for Sequential Multi-Modal Beam Prediction in mmWave Systems

Jiaming Zeng; Cunhua Pan; Haoyang Weng; Ruijing Liu; Hong Ren; and Jiangzhou Wang

arXiv:2604.05668·eess.SP·April 8, 2026

A BEV-Fusion Based Framework for Sequential Multi-Modal Beam Prediction in mmWave Systems

Jiaming Zeng, Cunhua Pan, Haoyang Weng, Ruijing Liu, Hong Ren, and Jiangzhou Wang

PDF

TL;DR

This paper introduces a BEV-Fusion framework that combines multiple sensor modalities in bird's-eye-view space for improved beam prediction in mmWave vehicular systems, reducing overhead.

Contribution

It proposes a novel BEV-based fusion method with a learned camera-to-BEV module and temporal transformer for motion-aware beam prediction, outperforming prior approaches.

Findings

01

Achieves approximately 87% distance-based accuracy on DeepSense 6G benchmark scenarios.

02

Outperforms the TransFuser baseline in multi-modal beam prediction.

03

Demonstrates the effectiveness of BEV-space fusion for sensing-assisted beam prediction.

Abstract

Beam prediction is critical for reducing beam-training overhead in millimeter-wave (mmWave) systems, especially in high-mobility vehicular scenarios. This paper presents a BEV-Fusion based framework that unifies camera, LiDAR, radar, and GPS modalities in a shared bird's-eye-view (BEV) representation for spatially consistent multi-modal fusion. Unlike priorapproaches that fuse globally pooled one-dimensional features, the proposed method performs fusion in BEV space to preservecross-modal geometric structure and visual semantic density. A learned camera-to-BEV module based on cross-attention is adopted to generate BEV-aligned visual features without relying on precise camera calibration, and a temporal transformer is used to aggregate five-step sequential observations for motion-aware beam prediction. Experiments on the DeepSense 6G benchmark show that BEV-Fusion achieves approximately…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.