Monocular absolute depth estimation from endoscopy via domain-invariant feature learning and latent consistency

Hao Li; Daiwei Lu; Jesse d'Almeida; Dilara Isik; Ehsan Khodapanah Aghdam; Nick DiSanto; Ayberk Acar; Susheela Sharma; Jie Ying Wu; Robert J. Webster III; Ipek Oguz

arXiv:2511.02247·cs.CV·November 5, 2025

Monocular absolute depth estimation from endoscopy via domain-invariant feature learning and latent consistency

Hao Li, Daiwei Lu, Jesse d'Almeida, Dilara Isik, Ehsan Khodapanah Aghdam, Nick DiSanto, Ayberk Acar, Susheela Sharma, Jie Ying Wu, Robert J. Webster III, Ipek Oguz

PDF

Open Access

TL;DR

This paper introduces a domain-invariant feature learning approach for monocular absolute depth estimation in endoscopy, effectively reducing domain gaps and improving depth accuracy in surgical scenes.

Contribution

It proposes a latent feature alignment method that is agnostic to image translation, enhancing depth estimation across real and synthetic endoscopic images.

Findings

01

Outperforms state-of-the-art methods in absolute and relative depth metrics

02

Improves depth estimation across various backbone networks

03

Demonstrates effectiveness on endoscopic videos of central airway phantoms

Abstract

Monocular depth estimation (MDE) is a critical task to guide autonomous medical robots. However, obtaining absolute (metric) depth from an endoscopy camera in surgical scenes is difficult, which limits supervised learning of depth on real endoscopic images. Current image-level unsupervised domain adaptation methods translate synthetic images with known depth maps into the style of real endoscopic frames and train depth networks using these translated images with their corresponding depth maps. However a domain gap often remains between real and translated synthetic images. In this paper, we present a latent feature alignment method to improve absolute depth estimation by reducing this domain gap in the context of endoscopic videos of the central airway. Our methods are agnostic to the image translation process and focus on the depth estimation itself. Specifically, the depth network…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Vision and Imaging · Medical Image Segmentation Techniques · Video Coding and Compression Technologies