Multi-Modal Sensor Fusion using Hybrid Attention for Autonomous Driving

Mayank Mayank; Bharanidhar Duraisamy; Florian Gei{\ss}; Abhinav Valada

arXiv:2604.04797·cs.CV·April 7, 2026

Multi-Modal Sensor Fusion using Hybrid Attention for Autonomous Driving

Mayank Mayank, Bharanidhar Duraisamy, Florian Gei{\ss}, Abhinav Valada

PDF

TL;DR

This paper introduces MMF-BEV, a hybrid radar-camera fusion framework using deformable attention for improved 3D object detection in autonomous driving, demonstrating superior performance over unimodal methods.

Contribution

The paper presents a novel radar-camera fusion architecture with deformable attention modules, along with a training strategy and sensor contribution analysis for interpretability.

Findings

01

MMF-BEV outperforms unimodal baselines in 3D detection accuracy.

02

Sensor contribution analysis reveals effective modality weighting at different distances.

03

The proposed method achieves competitive results against prior fusion approaches.

Abstract

Accurate 3D object detection for autonomous driving requires complementary sensors. Cameras provide dense semantics but unreliable depth, while millimeter-wave radar offers precise range and velocity measurements with sparse geometry. We propose MMF-BEV, a radar-camera BEV fusion framework that leverages deformable attention for cross-modal feature alignment on the View-of-Delft (VoD) 4D radar dataset [1]. MMF-BEV builds a BEVDepth [2] camera branch and a RadarBEVNet [3] radar branch, each enhanced with Deformable Self-Attention, and fuses them via a Deformable Cross-Attention module. We evaluate three configurations: camera-only, radar-only, and hybrid fusion. A sensor contribution analysis quantifies per-distance modality weighting, providing interpretable evidence of sensor complementarity. A two-stage training strategy - pre-training the camera branch with depth supervision, then…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.