WorDepth: Variational Language Prior for Monocular Depth Estimation

Ziyao Zeng; Daniel Wang; Fengyu Yang; Hyoungseob Park; Yangchao Wu,; Stefano Soatto; Byung-Woo Hong; Dong Lao; Alex Wong

arXiv:2404.03635·cs.CV·June 4, 2024·2 cites

WorDepth: Variational Language Prior for Monocular Depth Estimation

Ziyao Zeng, Daniel Wang, Fengyu Yang, Hyoungseob Park, Yangchao Wu,, Stefano Soatto, Byung-Woo Hong, Dong Lao, Alex Wong

PDF

Open Access 1 Repo

TL;DR

WorDepth introduces a variational language prior to enhance monocular depth estimation by integrating text descriptions, enabling more accurate 3D reconstructions from single images in indoor and outdoor scenes.

Contribution

This work is the first to incorporate a variational language prior into monocular depth estimation, leveraging text descriptions to improve depth prediction accuracy.

Findings

01

Language improves depth estimation performance on NYUv2 and KITTI datasets.

02

The variational framework effectively models the distribution of plausible 3D reconstructions.

03

The approach outperforms baseline methods without language integration.

Abstract

Three-dimensional (3D) reconstruction from a single image is an ill-posed problem with inherent ambiguities, i.e. scale. Predicting a 3D scene from text description(s) is similarly ill-posed, i.e. spatial arrangements of objects described. We investigate the question of whether two inherently ambiguous modalities can be used in conjunction to produce metric-scaled reconstructions. To test this, we focus on monocular depth estimation, the problem of predicting a dense depth map from a single image, but with an additional text caption describing the scene. To this end, we begin by encoding the text caption as a mean and standard deviation; using a variational framework, we learn the distribution of the plausible metric reconstructions of 3D scenes corresponding to the text captions as a prior. To "select" a specific reconstruction or depth map, we encode the given image through a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

adonis-galaxy/wordepth
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Vision and Imaging · Medical Image Segmentation Techniques · Image and Object Detection Techniques

MethodsFocus