PureCLIP-Depth: Prompt-Free and Decoder-Free Monocular Depth Estimation within CLIP Embedding Space
Ryutaro Miya, Kazuyoshi Fushinobu, Tatsuya Kawaguchi

TL;DR
PureCLIP-Depth introduces a novel prompt-free, decoder-free monocular depth estimation method that operates entirely within the CLIP embedding space, achieving state-of-the-art results without relying on geometric features.
Contribution
The paper presents a new approach to monocular depth estimation that directly maps RGB images to depth within the CLIP embedding space, eliminating the need for prompts or decoders.
Findings
Achieves state-of-the-art performance among CLIP-based models
Operates effectively on both indoor and outdoor datasets
Does not rely on geometric features or prompts
Abstract
We propose PureCLIP-Depth, a completely prompt-free, decoder-free Monocular Depth Estimation (MDE) model that operates entirely within the Contrastive Language-Image Pre-training (CLIP) embedding space. Unlike recent models that rely heavily on geometric features, we explore a novel approach to MDE driven by conceptual information, performing computations directly within the conceptual CLIP space. The core of our method lies in learning a direct mapping from the RGB domain to the depth domain strictly inside this embedding space. Our approach achieves state-of-the-art performance among CLIP embedding-based models on both indoor and outdoor datasets. The code used in this research is available at: https://github.com/ryutaroLF/PureCLIP-Depth
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Vision and Imaging · Domain Adaptation and Few-Shot Learning · Advanced Neural Network Applications
