RCDINO: Enhancing Radar-Camera 3D Object Detection with DINOv2 Semantic Features
Olga Matykina, Dmitry Yudin

TL;DR
RCDINO is a multimodal transformer model that enhances radar-camera 3D object detection by integrating DINOv2 semantic features, leading to state-of-the-art results on the nuScenes dataset.
Contribution
The paper introduces RCDINO, a novel fusion approach that combines pretrained DINOv2 semantic features with visual data for improved 3D detection.
Findings
Achieves 56.4 NDS and 48.1 mAP on nuScenes
Outperforms existing radar-camera detection models
Demonstrates effective multimodal feature fusion
Abstract
Three-dimensional object detection is essential for autonomous driving and robotics, relying on effective fusion of multimodal data from cameras and radar. This work proposes RCDINO, a multimodal transformer-based model that enhances visual backbone features by fusing them with semantically rich representations from the pretrained DINOv2 foundation model. This approach enriches visual representations and improves the model's detection performance while preserving compatibility with the baseline architecture. Experiments on the nuScenes dataset demonstrate that RCDINO achieves state-of-the-art performance among radar-camera models, with 56.4 NDS and 48.1 mAP. Our implementation is available at https://github.com/OlgaMatykina/RCDINO.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications · Advanced SAR Imaging Techniques · Domain Adaptation and Few-Shot Learning
