Language as Prior, Vision as Calibration: Metric Scale Recovery for Monocular Depth Estimation
Mingxia Zhan, Li Zhang, Beibei Wang, Yingjie Wang, Zenglin Shi

TL;DR
This paper introduces a method for recovering metric depth from monocular images by leveraging language cues and a calibration process that adapts to each image, improving accuracy and robustness across datasets.
Contribution
It proposes a novel calibration approach using language-based uncertainty envelopes and frozen visual features, enabling effective metric depth estimation without retraining the backbone.
Findings
Improves in-domain depth estimation accuracy on NYUv2 and KITTI.
Enhances zero-shot transfer robustness to SUN-RGBD and DDAD.
Uses language cues to bound calibration parameters, reducing domain shift sensitivity.
Abstract
Relative-depth foundation models transfer well, yet monocular metric depth remains ill-posed due to unidentifiable global scale and heightened domain-shift sensitivity. Under a frozen-backbone calibration setting, we recover metric depth via an image-specific affine transform in inverse depth and train only lightweight calibration heads while keeping the relative-depth backbone and the CLIP text encoder fixed. Since captions provide coarse but noisy scale cues that vary with phrasing and missing objects, we use language to predict an uncertainty-aware envelope that bounds feasible calibration parameters in an unconstrained space, rather than committing to a text-only point estimate. We then use pooled multi-scale frozen visual features to select an image-specific calibration within this envelope. During training, a closed-form least-squares oracle in inverse depth provides per-image…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Vision and Imaging · Optical measurement and interference techniques · Advanced Image Processing Techniques
