VGLD: Visually-Guided Linguistic Disambiguation for Monocular Depth Scale Recovery

Bojin Wu; Jing Chen

arXiv:2505.02704·cs.CV·July 15, 2025

VGLD: Visually-Guided Linguistic Disambiguation for Monocular Depth Scale Recovery

Bojin Wu, Jing Chen

PDF

Open Access 1 Repo

TL;DR

VGLD introduces a visual grounding framework that uses high-level visual semantics to disambiguate natural language descriptions, enabling more accurate and robust monocular depth scale recovery from ambiguous textual inputs.

Contribution

The paper proposes VGLD, a novel method that jointly encodes image and text to predict transformation parameters, improving metric depth estimation from ambiguous language.

Findings

01

VGLD reduces scale estimation bias caused by language ambiguity.

02

VGLD achieves robust metric predictions across indoor and outdoor benchmarks.

03

VGLD functions effectively as a universal, lightweight alignment module in zero-shot settings.

Abstract

Monocular depth estimation can be broadly categorized into two directions: relative depth estimation, which predicts normalized or inverse depth without absolute scale, and metric depth estimation, which aims to recover depth with real-world scale. While relative methods are flexible and data-efficient, their lack of metric scale limits their utility in downstream tasks. A promising solution is to infer absolute scale from textual descriptions. However, such language-based recovery is highly sensitive to natural language ambiguity, as the same image may be described differently across perspectives and styles. To address this, we introduce VGLD (Visually-Guided Linguistic Disambiguation), a framework that incorporates high-level visual semantics to resolve ambiguity in textual inputs. By jointly encoding both image and text, VGLD predicts a set of global linear transformation parameters…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

pakinwu/vgld
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Vision and Imaging · 3D Shape Modeling and Analysis · Medical Image Segmentation Techniques

MethodsSparse Evolutionary Training