PureCLIP-Depth: Prompt-Free and Decoder-Free Monocular Depth Estimation within CLIP Embedding Space

Ryutaro Miya; Kazuyoshi Fushinobu; Tatsuya Kawaguchi

arXiv:2603.16238·cs.CV·March 18, 2026

PureCLIP-Depth: Prompt-Free and Decoder-Free Monocular Depth Estimation within CLIP Embedding Space

Ryutaro Miya, Kazuyoshi Fushinobu, Tatsuya Kawaguchi

PDF

Open Access

TL;DR

PureCLIP-Depth introduces a novel prompt-free, decoder-free monocular depth estimation method that operates entirely within the CLIP embedding space, achieving state-of-the-art results without relying on geometric features.

Contribution

The paper presents a new approach to monocular depth estimation that directly maps RGB images to depth within the CLIP embedding space, eliminating the need for prompts or decoders.

Findings

01

Achieves state-of-the-art performance among CLIP-based models

02

Operates effectively on both indoor and outdoor datasets

03

Does not rely on geometric features or prompts

Abstract

We propose PureCLIP-Depth, a completely prompt-free, decoder-free Monocular Depth Estimation (MDE) model that operates entirely within the Contrastive Language-Image Pre-training (CLIP) embedding space. Unlike recent models that rely heavily on geometric features, we explore a novel approach to MDE driven by conceptual information, performing computations directly within the conceptual CLIP space. The core of our method lies in learning a direct mapping from the RGB domain to the depth domain strictly inside this embedding space. Our approach achieves state-of-the-art performance among CLIP embedding-based models on both indoor and outdoor datasets. The code used in this research is available at: https://github.com/ryutaroLF/PureCLIP-Depth

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Vision and Imaging · Domain Adaptation and Few-Shot Learning · Advanced Neural Network Applications