Focusable Monocular Depth Estimation

Yuxin Du; Tao Lin; Zile Zhong; Runting Li; Xiyao Chen; Jiting Liu; Chenglin Liu; Ying-Cong Chen; Yuqian Fu; and Bo Zhao

arXiv:2605.11756·cs.CV·May 13, 2026

Focusable Monocular Depth Estimation

Yuxin Du, Tao Lin, Zile Zhong, Runting Li, Xiyao Chen, Jiting Liu, Chenglin Liu, Ying-Cong Chen, Yuqian Fu, and Bo Zhao

PDF

TL;DR

This paper introduces Focusable Monocular Depth Estimation (FDE), a task and framework that enables depth models to prioritize user-specified regions, improving accuracy at boundaries and foregrounds while maintaining scene coherence.

Contribution

The paper proposes FocusDepth, a prompt-conditioned depth estimation framework with Multi-Scale Spatial-Aligned Fusion, and establishes FDE-Bench, a new benchmark for target-centric depth estimation.

Findings

01

FocusDepth outperforms baseline models on FDE-Bench in target regions.

02

MSSA's spatial alignment is crucial for prompt-guided depth accuracy.

03

FocusDepth achieves significant improvements in boundary and foreground regions.

Abstract

Monocular depth foundation models generalize well across scenes, yet they are typically optimized with uniform pixel-wise objectives that do not distinguish user-specified or task-relevant target regions from the surrounding context. We therefore introduce Focusable Monocular Depth Estimation (FDE), a region-aware depth estimation task in which, given a specified target region, the model is required to prioritize foreground depth accuracy, preserve sharp boundary transitions, and maintain coherent global scene geometry. To prioritize task-critical region modeling, we propose FocusDepth, a prompt-conditioned monocular relative depth estimation framework that guides depth modeling to focus on target regions via box/text prompts. The core Multi-Scale Spatial-Aligned Fusion (MSSA) in FocusDepth spatially aligns multi-scale features from Segment Anything Model 3 to the Depth Anything family…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.