Large Language Models Can Understanding Depth from Monocular Images

Zhongyi Xia; Tianzhao Wu

arXiv:2409.01133·cs.CV·September 4, 2024

Large Language Models Can Understanding Depth from Monocular Images

Zhongyi Xia, Tianzhao Wu

PDF

Open Access

TL;DR

This paper demonstrates that large language models can interpret depth from monocular images effectively with minimal supervision by using a novel multimodal framework called LLM-MDE, which employs cross-modal reprogramming and adaptive prompts.

Contribution

Introduces LLM-MDE, a multimodal framework that enables large language models to perform monocular depth estimation through innovative cross-modal reprogramming and prompt estimation techniques.

Findings

01

LLM-MDE outperforms existing methods in few-/zero-shot depth estimation tasks.

02

The framework minimizes resource utilization while maintaining high accuracy.

03

Experiments confirm the effectiveness of the proposed approach on real-world datasets.

Abstract

Monocular depth estimation is a critical function in computer vision applications. This paper shows that large language models (LLMs) can effectively interpret depth with minimal supervision, using efficient resource utilization and a consistent neural network architecture. We introduce LLM-MDE, a multimodal framework that deciphers depth through language comprehension. Specifically, LLM-MDE employs two main strategies to enhance the pretrained LLM's capability for depth estimation: cross-modal reprogramming and an adaptive prompt estimation module. These strategies align vision representations with text prototypes and automatically generate prompts based on monocular images, respectively. Comprehensive experiments on real-world MDE datasets confirm the effectiveness and superiority of LLM-MDE, which excels in few-/zero-shot tasks while minimizing resource use. The source code is…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHandwritten Text Recognition Techniques · 3D Surveying and Cultural Heritage

MethodsALIGN