Leveraging Stable Diffusion for Monocular Depth Estimation via Image Semantic Encoding
Jingming Xia, Guanqun Cao, Guang Ma, Yiben Luo, Qinzhao Li, John, Oyekan

TL;DR
This paper introduces a novel semantic encoding method leveraging Stable Diffusion for monocular depth estimation, improving robustness and accuracy in complex outdoor environments by extracting contextual visual features.
Contribution
The paper presents a new image-based semantic embedding that enhances depth prediction by directly utilizing visual features, addressing limitations of previous text-based models like CLIP.
Findings
Achieves comparable performance to state-of-the-art models on KITTI and Waymo datasets.
Demonstrates improved robustness and adaptability in outdoor depth estimation.
Addresses limitations of CLIP embeddings in complex outdoor scenes.
Abstract
Monocular depth estimation involves predicting depth from a single RGB image and plays a crucial role in applications such as autonomous driving, robotic navigation, 3D reconstruction, etc. Recent advancements in learning-based methods have significantly improved depth estimation performance. Generative models, particularly Stable Diffusion, have shown remarkable potential in recovering fine details and reconstructing missing regions through large-scale training on diverse datasets. However, models like CLIP, which rely on textual embeddings, face limitations in complex outdoor environments where rich context information is needed. These limitations reduce their effectiveness in such challenging scenarios. Here, we propose a novel image-based semantic embedding that extracts contextual information directly from visual features, significantly improving depth prediction in complex…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Vision and Imaging · Advanced Image Processing Techniques · Computer Graphics and Visualization Techniques
MethodsDiffusion · Contrastive Language-Image Pre-training
