# Radar–Camera Fusion in Perspective View and Bird’s Eye View for 3D Object Detection

**Authors:** Yuhao Xiao, Xiaoqing Chen, Yingkai Wang, Zhongliang Fu

PMC · DOI: 10.3390/s25196106 · 2025-10-03

## TL;DR

This paper introduces a new radar-camera fusion method for 3D object detection by combining perspective and bird's eye views, achieving better accuracy than existing approaches.

## Contribution

The novel dual-view fusion paradigm improves depth estimation and 3D object detection accuracy using cross-modal attention and radar image generation.

## Key findings

- The proposed method achieves state-of-the-art performance on the nuScenes dataset with 64.2 NDS and 56.3 mAP.
- Fusing perspective and bird's eye views enhances image BEV feature precision through improved depth estimation.
- A radar image generation module and cross-modal fusion module are effective in combining radar and camera features.

## Abstract

Three-dimensional object detection based on the fusion of millimeter-wave radar and cameras is increasingly gaining attention due to characteristics of low cost, high accuracy, and strong robustness. Recently, the bird’s eye view (BEV) fusion paradigm has dominated radar–camera fusion-based 3D object detection methods. In the BEV fusion paradigm, the detection accuracy is jointly determined by the precision of both image BEV features and radar BEV features. The precision of image BEV features is significantly influenced by depth estimation accuracy, whereas estimating depth from a monocular image is naturally a challenging, ill-posed problem. In this article, we propose a novel approach to enhance depth estimation accuracy by fusing camera perspective view (PV) features and radar perspective view features, thereby improving the precision of image BEV features. The refined image BEV features are then fused with radar BEV features to achieve more accurate 3D object detection results. To realize PV fusion, we designed a radar image generation module based on radar cross-section (RCS) and depth information, accurately projecting radar data into the camera view to generate radar images. The radar images are used to extract radar PV features. We present a cross-modal feature fusion module using the attention mechanism to dynamically fuse radar PV features with camera PV features. Comprehensive evaluations on the nuScenes 3D object detection dataset demonstrate that the proposed dual-view fusion paradigm outperforms the BEV fusion paradigm, achieving state-of-the-art performance with 64.2 NDS and 56.3 mAP.

## Full-text entities

- **Genes:** CRNKL1 (crooked neck pre-mRNA splicing factor 1) [NCBI Gene 51340] {aka CLF, CRN, Clf1, HCRN, MGCH, MSTP021}
- **Diseases:** injury to (MESH:D014947)
- **Species:** Homo sapiens (human, species) [taxon 9606]

## Figures

4 figures with captions in the complete paper: https://tomesphere.com/paper/PMC12526787/full.md

---
Source: https://tomesphere.com/paper/PMC12526787