RAD: Retrieval-Augmented Monocular Metric Depth Estimation for Underrepresented Classes

Michael Baltaxe; Dan Levi; Sagie Benaim

arXiv:2602.09532·cs.CV·April 7, 2026

RAD: Retrieval-Augmented Monocular Metric Depth Estimation for Underrepresented Classes

Michael Baltaxe, Dan Levi, Sagie Benaim

PDF

TL;DR

RAD introduces a retrieval-augmented approach for monocular depth estimation that enhances accuracy for underrepresented classes by leveraging semantically similar context samples and a cross-attention fusion mechanism.

Contribution

The paper presents a novel retrieval-augmented framework that improves monocular depth estimation for underrepresented classes using a dual-stream network and cross-attention.

Findings

01

RAD reduces relative absolute error by 29.2% on NYU Depth v2.

02

RAD outperforms state-of-the-art baselines on KITTI and Cityscapes.

03

RAD maintains competitive performance on standard benchmarks.

Abstract

Monocular Metric Depth Estimation (MMDE) is essential for physically intelligent systems, yet accurate depth estimation for underrepresented classes in complex scenes remains a persistent challenge. To address this, we propose RAD, a retrieval-augmented framework that approximates the benefits of multi-view stereo by utilizing retrieved neighbors as structural geometric proxies. Our method first employs an uncertainty-aware retrieval mechanism to identify low-confidence regions in the input and retrieve RGB-D context samples containing semantically similar content. We then process both the input and retrieved context via a dual-stream network and fuse them using a matched cross-attention module, which transfers geometric information only at reliable point correspondences. Evaluations on NYU Depth v2, KITTI, and Cityscapes demonstrate that RAD significantly outperforms state-of-the-art…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.