BenchDepth: Are We on the Right Way to Evaluate Depth Foundation Models?

Zhenyu Li; Haotong Lin; Jiashi Feng; Peter Wonka; Bingyi Kang

arXiv:2507.15321·cs.CV·July 22, 2025

BenchDepth: Are We on the Right Way to Evaluate Depth Foundation Models?

Zhenyu Li, Haotong Lin, Jiashi Feng, Peter Wonka, Bingyi Kang

PDF

4 Reviews

TL;DR

BenchDepth introduces a new evaluation benchmark for depth foundation models that focuses on their practical utility across multiple real-world tasks, addressing biases in traditional metrics.

Contribution

The paper proposes BenchDepth, a comprehensive benchmark evaluating depth models through downstream tasks, avoiding alignment biases and promoting fairer, application-oriented assessments.

Findings

01

Current benchmarks have biases favoring certain depth representations.

02

DFMs show varied performance across different downstream tasks.

03

BenchDepth provides a more practical evaluation framework for depth models.

Abstract

Depth estimation is a fundamental task in computer vision with diverse applications. Recent advancements in deep learning have led to powerful depth foundation models (DFMs), yet their evaluation remains challenging due to inconsistencies in existing protocols. Traditional benchmarks rely on alignment-based metrics that introduce biases, favor certain depth representations, and complicate fair comparisons. In this work, we propose BenchDepth, a new benchmark that evaluates DFMs through five carefully selected downstream proxy tasks: depth completion, stereo matching, monocular feed-forward 3D scene reconstruction, SLAM, and vision-language spatial understanding. Unlike conventional evaluation protocols, our approach assesses DFMs based on their practical utility in real-world applications, bypassing problematic alignment procedures. We benchmark eight state-of-the-art DFMs and provide…

Peer Reviews

Decision·ICLR 2026 Conference Withdrawn Submission

Reviewer 01Rating 2Confidence 4

Strengths

1. The work addresses a well-recognized gap: depth metrics (e.g., AbsRel, RMSE, δ thresholds) do not reliably predict whether a depth model is useful when integrated into 3D or geometric pipelines. Positioning evaluation around actual deployment tasks is well justified. 2. Diverse and well-chosen downstream tasks. The inclusion of tasks spanning low-level to high-level reasoning adds credibility to the claim that the benchmark captures general utility rather than performance tuned to a particul

Weaknesses

1. Interpretation of the reported negative correlation between traditional metrics and downstream rankings requires clarification. The paper claims that standard depth metrics negatively correlate with downstream task performance. Taken at face value, this would imply that models performing worse under standard metrics could perform better in real applications, which is conceptually difficult to justify. 2. The benchmark only uses one baseline architecture per task. Performance may depend on m

Reviewer 02Rating 2Confidence 4

Strengths

1. The paper shifts the benchmark focus from geometry-based metrics to application-driven performance, which brings a new insight for the community. 2. The benchmark covers both low-level and high-level depth tasks, which provides a multi-layered perspective on depth model utility. 3. The paper presents detailed quantitative analysis and interesting observations, such as strong internal consistency between proxy tasks but weak correlation with traditional benchmarks. 4. The multiple aspects of

Weaknesses

1. Depth estimation is fundamentally and inherently a metric prediction problem (Unlike LLMs and VLMs mentioned in line 49 – 52, whose evaluation rely on plausibility or relevance rather than quantitative accuracy). Thus, any evaluation suggesting the opposite conclusion must explain why geometric fidelity loses its predictive power. The authors fail to provide a compelling theoretical basis for this discrepancy, leaving open the possibility that the issue comes from their task integration rathe

Reviewer 03Rating 4Confidence 4

Strengths

Improved benchmarking in monocular reconstruction is very welcome. The use of correlation to identify a coherent subset of tasks that make a benchmark is sensible. Generally, I appreciate the design decisions, and I think this will make a valuable contribution to the literature when it is finished.

Weaknesses

The coherence of the benchmark depends on the correlation analysis highlighted in Figure 1, and this is incomplete. Task 5 in the text (VLM Spatial Understanding) is excluded from the figure, and the justification for excluding it doesn't really make sense. While the measure "Pos" shows little variation between the models, many of the other metrics show extremely large shifts. Tasks 5,6,7 in Figure 1 (average, traditional delta, and traditional absrel) are not reported in the text. In additi

Reviewer 04Rating 4Confidence 4

Strengths

1. The paper targets an important and timely question regarding how to evaluate large-scale depth foundation models. Through experimental results, the paper identifies an evaluation gap between geometric accuracy and downstream performance on different tasks, which is an interesting observation. 2. The authors perform a large and comprehensive experimental study, covering multiple downstream tasks and eight modern DFMs. The empirical effort is substantial, and the collected results could serve

Weaknesses

1. The core research question is posed in the title (“Are we on the right way to evaluate DFMs?”). However, after carefully reading the paper, I find this question never brought up in the paper, never addressed in the paper, never answered in the paper, and never discussed in-depth in the paper. Sure, observations made by the paper (mismatch between current evaluation protocols and downstream performance) are valuable to this question, yet the discussion ends up right there without much high-lev

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.