RCDINO: Enhancing Radar-Camera 3D Object Detection with DINOv2 Semantic Features

Olga Matykina; Dmitry Yudin

arXiv:2508.15353·cs.CV·August 22, 2025

RCDINO: Enhancing Radar-Camera 3D Object Detection with DINOv2 Semantic Features

Olga Matykina, Dmitry Yudin

PDF

Open Access

TL;DR

RCDINO is a multimodal transformer model that enhances radar-camera 3D object detection by integrating DINOv2 semantic features, leading to state-of-the-art results on the nuScenes dataset.

Contribution

The paper introduces RCDINO, a novel fusion approach that combines pretrained DINOv2 semantic features with visual data for improved 3D detection.

Findings

01

Achieves 56.4 NDS and 48.1 mAP on nuScenes

02

Outperforms existing radar-camera detection models

03

Demonstrates effective multimodal feature fusion

Abstract

Three-dimensional object detection is essential for autonomous driving and robotics, relying on effective fusion of multimodal data from cameras and radar. This work proposes RCDINO, a multimodal transformer-based model that enhances visual backbone features by fusing them with semantically rich representations from the pretrained DINOv2 foundation model. This approach enriches visual representations and improves the model's detection performance while preserving compatibility with the baseline architecture. Experiments on the nuScenes dataset demonstrate that RCDINO achieves state-of-the-art performance among radar-camera models, with 56.4 NDS and 48.1 mAP. Our implementation is available at https://github.com/OlgaMatykina/RCDINO.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Neural Network Applications · Advanced SAR Imaging Techniques · Domain Adaptation and Few-Shot Learning