ECoDepth: Effective Conditioning of Diffusion Models for Monocular Depth Estimation
Suraj Patni, Aradhye Agarwal, Chetan Arora

TL;DR
ECoDepth introduces a novel depth estimation model conditioned on ViT embeddings, achieving state-of-the-art accuracy on standard datasets and strong zero-shot transfer performance by leveraging pre-trained image priors.
Contribution
The paper proposes a new SIDE model using a diffusion backbone conditioned on ViT embeddings, surpassing previous methods in accuracy and transferability.
Findings
Achieves SOTA on NYUv2 with 0.059 Abs Rel error.
Improves KITTI Sq Rel error to 0.139.
Demonstrates strong zero-shot transfer across multiple datasets.
Abstract
In the absence of parallax cues, a learning-based single image depth estimation (SIDE) model relies heavily on shading and contextual cues in the image. While this simplicity is attractive, it is necessary to train such models on large and varied datasets, which are difficult to capture. It has been shown that using embeddings from pre-trained foundational models, such as CLIP, improves zero shot transfer in several applications. Taking inspiration from this, in our paper we explore the use of global image priors generated from a pre-trained ViT model to provide more detailed contextual information. We argue that the embedding vector from a ViT model, pre-trained on a large dataset, captures greater relevant information for SIDE than the usual route of generating pseudo image captions, followed by CLIP based text embeddings. Based on this idea, we propose a new SIDE model using a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsIndustrial Vision Systems and Defect Detection · Optical measurement and interference techniques · 3D Shape Modeling and Analysis
MethodsSoftmax · Linear Layer · Layer Normalization · Residual Connection · Attention Is All You Need · Dense Connections · Multi-Head Attention · Vision Transformer · Diffusion
