Unlocking Dense Metric Depth Estimation in VLMs

Hanxun Yu; Xuan Qu; Yuxin Wang; Jianke Zhu; Lei Ke

arXiv:2605.15876·cs.CV·May 21, 2026

Unlocking Dense Metric Depth Estimation in VLMs

Hanxun Yu, Xuan Qu, Yuxin Wang, Jianke Zhu, Lei Ke

PDF

2 Repos 1 Models 1 Datasets

TL;DR

DepthVLM transforms existing vision-language models into dense 3D geometry predictors by attaching a lightweight depth head, enabling full-resolution depth maps alongside language outputs efficiently.

Contribution

It introduces DepthVLM, a novel framework that enhances VLMs with dense depth prediction capabilities without sacrificing multimodal performance.

Findings

01

DepthVLM outperforms existing VLMs in depth estimation accuracy.

02

It achieves higher inference efficiency compared to prior methods.

03

Improves complex 3D spatial reasoning in multimodal tasks.

Abstract

Vision-Language Models (VLMs) excel at 2D tasks such as grounding and captioning, yet remain limited in 3D understanding. A key limitation is their text-only supervision paradigm, which under-constrains fine-grained visual perception and prevents the recovery of dense geometry. Prior methods either distill geometry from external vision models, introducing error accumulation, or enable direct prediction with inefficient per-pixel query or coarse token-level outputs. In this paper, we propose DepthVLM, a simple yet effective framework that transforms a single VLM into a native dense geometry predictor while preserving its multimodal capability. By attaching a lightweight depth head to the LLM backbone and training under a unified vision-text supervision paradigm with a two-stage schedule, DepthVLM generates full-resolution depth maps alongside language outputs in a single forward pass. We…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Models

🤗
JonnyYu828/DepthVLM-4B
model· 570 dl· ♡ 6
570 dl♡ 6

Datasets

JonnyYu828/DepthVLM-Bench
dataset· 109 dl
109 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.