Depth Anything with Any Prior
Zehan Wang, Siyu Chen, Lihe Yang, Jialei Wang, Ziang Zhang, Hengshuang Zhao, Zhou Zhao

TL;DR
This paper introduces Prior Depth Anything, a framework that combines metric priors and relative depth predictions through a coarse-to-fine pipeline, achieving accurate, dense depth maps with strong zero-shot generalization across multiple tasks.
Contribution
It proposes a novel integration method for metric priors and depth predictions, including pixel-level alignment and a conditioned MDE model, enabling flexible, high-quality depth estimation across diverse scenarios.
Findings
Effective zero-shot generalization across multiple datasets
Outperforms previous task-specific depth methods
Enables test-time model switching for accuracy-efficiency trade-offs
Abstract
This work presents Prior Depth Anything, a framework that combines incomplete but precise metric information in depth measurement with relative but complete geometric structures in depth prediction, generating accurate, dense, and detailed metric depth maps for any scene. To this end, we design a coarse-to-fine pipeline to progressively integrate the two complementary depth sources. First, we introduce pixel-level metric alignment and distance-aware weighting to pre-fill diverse metric priors by explicitly using depth prediction. It effectively narrows the domain gap between prior patterns, enhancing generalization across varying scenarios. Second, we develop a conditioned monocular depth estimation (MDE) model to refine the inherent noise of depth priors. By conditioning on the normalized pre-filled prior and prediction, the model further implicitly merges the two complementary depth…
Peer Reviews
Decision·ICLR 2026 Poster
- Extensive experiments are conducted to show that the method performs strongly across different sources of depth priors, which is shown to be a key strength of the model compared to existing works. The proposed method convincingly generalizes significantly better compared to existing state-of-the-art approaches. - The pixel-level alignment method for in-painting missing depth regions seems novel, and shown to be significantly better than naive interpolation in Table 6 and 9. There is reason to
- The method first requires using a heuristic-based approach for densifying a depth prior. This might not be effective or practical especially when the prior map is extremely sparse, from both a latency (since densification requires solving a least-square regression for each missing pixel) and performance (the nearest neighbors might be extremely far away from the query pixel) standpoint. - Minor: I am not sure whether it is an issue with my PDF viewer, but the formatting of the paper seems off
1. The paper presents extensive experiments that effectively validate the proposed method and justify its performance against benchmark approaches. 2. The proposed framework achieves enhanced depth estimation quality through a simple and well-structured pipeline.
1. The proposed method shows limited novelty, as it primarily predicts per-pixel scale and shift values in MDE to align with the depth prior. 2. The paper states, “We highlight best and second-best results” in the quantitative results section; however, only a few columns in Tables 2–5 are correctly annotated. The remaining columns contain inconsistencies, including missing annotations, unclear labeling, or incorrect markings. 3. The reported quantitative and qualitative results do not convincing
Broad applicability and novelty: One framework handles completion, upsampling, inpainting, and their combinations, covering common real-world inputs that previous methods often treat separately. Strong empirical performance: Consistent zero-shot results across multiple datasets and tasks, often matching or surpassing task-specific baselines without per-task fine-tuning. Robust to mixed priors: Maintains accuracy when prior types co-occur (e.g., sparse + low-res + holes), a challenging but prac
Efficiency underreported: The two-stage design (predictor + kNN-style alignment + conditioned refiner) lacks detailed latency, memory, and component-wise cost, leaving deployability unclear. Ablations and sensitivity limited: The impact of weaker/faster predictors, distance-aware weighting, neighborhood size, or removing coarse alignment is not fully quantified.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Vision and Imaging · Human Pose and Action Recognition · Advanced Neural Network Applications
MethodsInpainting
