Unlocking Dense Metric Depth Estimation in VLMs
Hanxun Yu, Xuan Qu, Yuxin Wang, Jianke Zhu, Lei Ke

TL;DR
DepthVLM transforms existing vision-language models into dense 3D geometry predictors by attaching a lightweight depth head, enabling full-resolution depth maps alongside language outputs efficiently.
Contribution
It introduces DepthVLM, a novel framework that enhances VLMs with dense depth prediction capabilities without sacrificing multimodal performance.
Findings
DepthVLM outperforms existing VLMs in depth estimation accuracy.
It achieves higher inference efficiency compared to prior methods.
Improves complex 3D spatial reasoning in multimodal tasks.
Abstract
Vision-Language Models (VLMs) excel at 2D tasks such as grounding and captioning, yet remain limited in 3D understanding. A key limitation is their text-only supervision paradigm, which under-constrains fine-grained visual perception and prevents the recovery of dense geometry. Prior methods either distill geometry from external vision models, introducing error accumulation, or enable direct prediction with inefficient per-pixel query or coarse token-level outputs. In this paper, we propose DepthVLM, a simple yet effective framework that transforms a single VLM into a native dense geometry predictor while preserving its multimodal capability. By attaching a lightweight depth head to the LLM backbone and training under a unified vision-text supervision paradigm with a two-stage schedule, DepthVLM generates full-resolution depth maps alongside language outputs in a single forward pass. We…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
