WorDepth: Variational Language Prior for Monocular Depth Estimation
Ziyao Zeng, Daniel Wang, Fengyu Yang, Hyoungseob Park, Yangchao Wu,, Stefano Soatto, Byung-Woo Hong, Dong Lao, Alex Wong

TL;DR
WorDepth introduces a variational language prior to enhance monocular depth estimation by integrating text descriptions, enabling more accurate 3D reconstructions from single images in indoor and outdoor scenes.
Contribution
This work is the first to incorporate a variational language prior into monocular depth estimation, leveraging text descriptions to improve depth prediction accuracy.
Findings
Language improves depth estimation performance on NYUv2 and KITTI datasets.
The variational framework effectively models the distribution of plausible 3D reconstructions.
The approach outperforms baseline methods without language integration.
Abstract
Three-dimensional (3D) reconstruction from a single image is an ill-posed problem with inherent ambiguities, i.e. scale. Predicting a 3D scene from text description(s) is similarly ill-posed, i.e. spatial arrangements of objects described. We investigate the question of whether two inherently ambiguous modalities can be used in conjunction to produce metric-scaled reconstructions. To test this, we focus on monocular depth estimation, the problem of predicting a dense depth map from a single image, but with an additional text caption describing the scene. To this end, we begin by encoding the text caption as a mean and standard deviation; using a variational framework, we learn the distribution of the plausible metric reconstructions of 3D scenes corresponding to the text captions as a prior. To "select" a specific reconstruction or depth map, we encode the given image through a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Vision and Imaging · Medical Image Segmentation Techniques · Image and Object Detection Techniques
MethodsFocus
