RAD: Retrieval-Augmented Monocular Metric Depth Estimation for Underrepresented Classes
Michael Baltaxe, Dan Levi, Sagie Benaim

TL;DR
RAD introduces a retrieval-augmented approach for monocular depth estimation that enhances accuracy for underrepresented classes by leveraging semantically similar context samples and a cross-attention fusion mechanism.
Contribution
The paper presents a novel retrieval-augmented framework that improves monocular depth estimation for underrepresented classes using a dual-stream network and cross-attention.
Findings
RAD reduces relative absolute error by 29.2% on NYU Depth v2.
RAD outperforms state-of-the-art baselines on KITTI and Cityscapes.
RAD maintains competitive performance on standard benchmarks.
Abstract
Monocular Metric Depth Estimation (MMDE) is essential for physically intelligent systems, yet accurate depth estimation for underrepresented classes in complex scenes remains a persistent challenge. To address this, we propose RAD, a retrieval-augmented framework that approximates the benefits of multi-view stereo by utilizing retrieved neighbors as structural geometric proxies. Our method first employs an uncertainty-aware retrieval mechanism to identify low-confidence regions in the input and retrieve RGB-D context samples containing semantically similar content. We then process both the input and retrieved context via a dual-stream network and fuse them using a matched cross-attention module, which transfers geometric information only at reliable point correspondences. Evaluations on NYU Depth v2, KITTI, and Cityscapes demonstrate that RAD significantly outperforms state-of-the-art…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
