Lightweight Prompt-Guided CLIP Adaptation for Monocular Depth Estimation
Reyhaneh Ahani Manghotay (Simon Fraser University, Burnaby, Canada), Jie Liang (Eastern Institute of Technology, Ningbo, China)

TL;DR
This paper introduces MoA-DepthCLIP, a lightweight, parameter-efficient framework that adapts CLIP for monocular depth estimation, achieving high accuracy with minimal fine-tuning and structural constraints.
Contribution
The paper proposes a novel, lightweight adaptation method using Mixture-of-Adapters for CLIP, improving monocular depth estimation with minimal supervision and parameters.
Findings
Achieves $ ext{delta}_1$ accuracy of 0.745 on NYU Depth V2
Reduces RMSE from 1.176 to 0.520 compared to baseline
Requires significantly fewer trainable parameters
Abstract
Leveraging the rich semantic features of vision-language models (VLMs) like CLIP for monocular depth estimation tasks is a promising direction, yet often requires extensive fine-tuning or lacks geometric precision. We present a parameter-efficient framework, named MoA-DepthCLIP, that adapts pretrained CLIP representations for monocular depth estimation with minimal supervision. Our method integrates a lightweight Mixture-of-Adapters (MoA) module into the pretrained Vision Transformer (ViT-B/32) backbone combined with selective fine-tuning of the final layers. This design enables spatially-aware adaptation, guided by a global semantic context vector and a hybrid prediction architecture that synergizes depth bin classification with direct regression. To enhance structural accuracy, we employ a composite loss function that enforces geometric constraints. On the NYU Depth V2 benchmark,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
