Learning to Adapt CLIP for Few-Shot Monocular Depth Estimation
Xueting Hu, Ce Zhang, Yi Zhang, Bowen Hai, Ke Yu, Zhihai He

TL;DR
This paper introduces a few-shot learning approach that adapts CLIP for monocular depth estimation, allowing scene-specific depth bin assignment and improved generalization with minimal training data.
Contribution
It proposes a novel method to adapt CLIP for depth estimation using few-shot learning and learnable prompts, enhancing accuracy and scene adaptability.
Findings
Outperforms previous methods by up to 10.6% in MARE on NYU V2 and KITTI datasets.
Uses only one image per scene for training, reducing data requirements.
Demonstrates improved generalization across diverse scenes.
Abstract
Pre-trained Vision-Language Models (VLMs), such as CLIP, have shown enhanced performance across a range of tasks that involve the integration of visual and linguistic modalities. When CLIP is used for depth estimation tasks, the patches, divided from the input images, can be combined with a series of semantic descriptions of the depth information to obtain similarity results. The coarse estimation of depth is then achieved by weighting and summing the depth values, called depth bins, corresponding to the predefined semantic descriptions. The zero-shot approach circumvents the computational and time-intensive nature of traditional fully-supervised depth estimation methods. However, this method, utilizing fixed depth bins, may not effectively generalize as images from different scenes may exhibit distinct depth distributions. To address this challenge, we propose a few-shot-based method…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Learning To Adapt CLIP for Few-Shot Monocular Depth Estimation· youtube
Taxonomy
TopicsAdvanced Vision and Imaging · Multimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques
MethodsContrastive Language-Image Pre-training
