Lightweight Prompt-Guided CLIP Adaptation for Monocular Depth Estimation

Reyhaneh Ahani Manghotay (Simon Fraser University; Burnaby; Canada); Jie Liang (Eastern Institute of Technology; Ningbo; China)

arXiv:2604.01118·cs.CV·April 2, 2026

Lightweight Prompt-Guided CLIP Adaptation for Monocular Depth Estimation

Reyhaneh Ahani Manghotay (Simon Fraser University, Burnaby, Canada), Jie Liang (Eastern Institute of Technology, Ningbo, China)

PDF

TL;DR

This paper introduces MoA-DepthCLIP, a lightweight, parameter-efficient framework that adapts CLIP for monocular depth estimation, achieving high accuracy with minimal fine-tuning and structural constraints.

Contribution

The paper proposes a novel, lightweight adaptation method using Mixture-of-Adapters for CLIP, improving monocular depth estimation with minimal supervision and parameters.

Findings

01

Achieves $ ext{delta}_1$ accuracy of 0.745 on NYU Depth V2

02

Reduces RMSE from 1.176 to 0.520 compared to baseline

03

Requires significantly fewer trainable parameters

Abstract

Leveraging the rich semantic features of vision-language models (VLMs) like CLIP for monocular depth estimation tasks is a promising direction, yet often requires extensive fine-tuning or lacks geometric precision. We present a parameter-efficient framework, named MoA-DepthCLIP, that adapts pretrained CLIP representations for monocular depth estimation with minimal supervision. Our method integrates a lightweight Mixture-of-Adapters (MoA) module into the pretrained Vision Transformer (ViT-B/32) backbone combined with selective fine-tuning of the final layers. This design enables spatially-aware adaptation, guided by a global semantic context vector and a hybrid prediction architecture that synergizes depth bin classification with direct regression. To enhance structural accuracy, we employ a composite loss function that enforces geometric constraints. On the NYU Depth V2 benchmark,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.