AnyDepth: Depth Estimation Made Easy
Zeyu Ren, Zeyu Zhang, Wukai Li, Qingxiang Liu, Hao Tang

TL;DR
AnyDepth introduces a lightweight, data-centric framework for zero-shot monocular depth estimation that combines a high-quality visual encoder with a compact transformer decoder, achieving high accuracy with fewer parameters.
Contribution
The paper presents a novel simple depth transformer (SDT) decoder and a quality-based filtering strategy, significantly reducing model complexity while improving accuracy in zero-shot depth estimation.
Findings
Outperforms DPT in accuracy across five benchmarks.
Reduces model parameters by approximately 85%-89%.
Enhances training quality by filtering harmful samples.
Abstract
Monocular depth estimation aims to recover the depth information of 3D scenes from 2D images. Recent work has made significant progress, but its reliance on large-scale datasets and complex decoders has limited its efficiency and generalization ability. In this paper, we propose a lightweight and data-centric framework for zero-shot monocular depth estimation. We first adopt DINOv3 as the visual encoder to obtain high-quality dense features. Secondly, to address the inherent drawbacks of the complex structure of the DPT, we design the Simple Depth Transformer (SDT), a compact transformer-based decoder. Compared to the DPT, it uses a single-path feature fusion and upsampling process to reduce the computational overhead of cross-scale feature fusion, achieving higher accuracy while reducing the number of parameters by approximately 85%-89%. Furthermore, we propose a quality-based…
Peer Reviews
Decision·ICLR 2026 Conference Withdrawn Submission
1. Lack of comparison to state-of-the-arts The author compares their proposed method with DPT and Depth-Anything v2, but they are not the best state-of-the-art works and there are better state-of-the-art works that authors needs to compare. [1] Repurposing Diffusion-Based Image Generators for Monocular Depth Estimation [2] Metric3d v2: A versatile monocular geometric foundation model for zero-shot metric depth and surface normal estimation [3] Depth pro: Sharp monocular metric depth in less tha
1. Limited comparison with state-of-the-art methods The authors compare their proposed approach only with DPT and Depth Anything v2, which, while relevant, do not represent the current state-of-the-art in monocular depth estimation. To strengthen the empirical validation, comparisons should be made with more recent and competitive methods, such as: [1] Repurposing Diffusion-Based Image Generators for Monocular Depth Estimation [2] Metric3D v2: A Versatile Monocular Geometric Foundation Model fo
The proposed Depth Distribution Score and Gradient Continuity Score are interesting metrics for assessing depth-map quality, but some aspects remain limited.
(1) The metrics prioritize distribution smoothness over true geometric fidelity. Consequently, a high score does not necessarily correlate with an accurate depth map. (2) The selection of hyperparameters—such as the number of bins, weight values, and normalization schemes—appears empirical. The approach lacks theoretical justification or a sensitivity analysis to validate these choices. (3) The evaluation fails to account for semantic information, which can lead to the unfair penalization of s
The motivation to reduce data requirements of depth estimation and at the same time reduce the parameter count of these models is meaningful and interesting. The paper shows clear efficiency gains over prior methods while also maintaining/improving overall performance. The idea is straightforward and the paper is well written. Figures 2 and 3 immediately illustrate the advantages of the method. It can be considered as the first application of DINOv3 for zero-shot depth estimation.
There is limited novelty within the data-centric learning metrics. The depth distribution score and gradient continuity score are simple to come up with and lack of some theoretical justification. In the experiments involving these metrics there is no sensitivity analysis or a comparison to other data sampling strategies, e.g. uncertainty aware filtering with techniques such as [1]. The effectiveness of data filtering is incremental. Table 3 shows only a minor benefit from applying the introduc
- The paper is well writen and it's easy for me to follow. - Having a more lightweight and effective decoder compared with DPT would be very helpful for the depth community. - I like the idea of using less data and trying to achieve comparable performance. It's well motivated.
- It's a bit over-claiming to regard adopting DINOV3 as a major contribution. - The SDT head should be more carefully ablated to demonstrate better than DPT head. Now, it's only proven to be better than DPT head in a frozen DINOV3 setting. But in most cases, people don't freeze the encoder during training. Will SDT head be better than DPT head when fine-tuning the DINOV3 encoder as well? On the other hand, it would be necessary to use various encoders in experiments and prove that SDT head can
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Vision and Imaging · Image Processing Techniques and Applications · Advanced Neural Network Applications
