Leveraging Stable Diffusion for Monocular Depth Estimation via Image   Semantic Encoding

Jingming Xia; Guanqun Cao; Guang Ma; Yiben Luo; Qinzhao Li; John; Oyekan

arXiv:2502.01666·cs.CV·February 5, 2025

Leveraging Stable Diffusion for Monocular Depth Estimation via Image Semantic Encoding

Jingming Xia, Guanqun Cao, Guang Ma, Yiben Luo, Qinzhao Li, John, Oyekan

PDF

Open Access

TL;DR

This paper introduces a novel semantic encoding method leveraging Stable Diffusion for monocular depth estimation, improving robustness and accuracy in complex outdoor environments by extracting contextual visual features.

Contribution

The paper presents a new image-based semantic embedding that enhances depth prediction by directly utilizing visual features, addressing limitations of previous text-based models like CLIP.

Findings

01

Achieves comparable performance to state-of-the-art models on KITTI and Waymo datasets.

02

Demonstrates improved robustness and adaptability in outdoor depth estimation.

03

Addresses limitations of CLIP embeddings in complex outdoor scenes.

Abstract

Monocular depth estimation involves predicting depth from a single RGB image and plays a crucial role in applications such as autonomous driving, robotic navigation, 3D reconstruction, etc. Recent advancements in learning-based methods have significantly improved depth estimation performance. Generative models, particularly Stable Diffusion, have shown remarkable potential in recovering fine details and reconstructing missing regions through large-scale training on diverse datasets. However, models like CLIP, which rely on textual embeddings, face limitations in complex outdoor environments where rich context information is needed. These limitations reduce their effectiveness in such challenging scenarios. Here, we propose a novel image-based semantic embedding that extracts contextual information directly from visual features, significantly improving depth prediction in complex…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Vision and Imaging · Advanced Image Processing Techniques · Computer Graphics and Visualization Techniques

MethodsDiffusion · Contrastive Language-Image Pre-training