VGLD: Visually-Guided Linguistic Disambiguation for Monocular Depth Scale Recovery
Bojin Wu, Jing Chen

TL;DR
VGLD introduces a visual grounding framework that uses high-level visual semantics to disambiguate natural language descriptions, enabling more accurate and robust monocular depth scale recovery from ambiguous textual inputs.
Contribution
The paper proposes VGLD, a novel method that jointly encodes image and text to predict transformation parameters, improving metric depth estimation from ambiguous language.
Findings
VGLD reduces scale estimation bias caused by language ambiguity.
VGLD achieves robust metric predictions across indoor and outdoor benchmarks.
VGLD functions effectively as a universal, lightweight alignment module in zero-shot settings.
Abstract
Monocular depth estimation can be broadly categorized into two directions: relative depth estimation, which predicts normalized or inverse depth without absolute scale, and metric depth estimation, which aims to recover depth with real-world scale. While relative methods are flexible and data-efficient, their lack of metric scale limits their utility in downstream tasks. A promising solution is to infer absolute scale from textual descriptions. However, such language-based recovery is highly sensitive to natural language ambiguity, as the same image may be described differently across perspectives and styles. To address this, we introduce VGLD (Visually-Guided Linguistic Disambiguation), a framework that incorporates high-level visual semantics to resolve ambiguity in textual inputs. By jointly encoding both image and text, VGLD predicts a set of global linear transformation parameters…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Vision and Imaging · 3D Shape Modeling and Analysis · Medical Image Segmentation Techniques
MethodsSparse Evolutionary Training
