On the Role of Depth in Surgical Vision Foundation Models: An Empirical Study of RGB-D Pre-training

John J. Han; Adam Schmidt; Muhammad Abdullah Jamal; Chinedu Nwoye; Anita Rau; Jie Ying Wu; Omid Mohareri

arXiv:2601.18929·cs.CV·January 28, 2026

On the Role of Depth in Surgical Vision Foundation Models: An Empirical Study of RGB-D Pre-training

John J. Han, Adam Schmidt, Muhammad Abdullah Jamal, Chinedu Nwoye, Anita Rau, Jie Ying Wu, Omid Mohareri

PDF

Open Access

TL;DR

This study empirically demonstrates that incorporating depth information during pre-training significantly enhances surgical vision models' performance and data efficiency without altering inference architecture.

Contribution

It provides the first large-scale empirical comparison of RGB versus RGB-D pre-training for surgical vision foundation models, highlighting the benefits of geometric-aware pre-training.

Findings

01

Depth-aware models outperform RGB-only models across tasks.

02

Geometric pre-training improves data efficiency, surpassing full-data RGB models with only 25% data.

03

Depth is used only during pre-training, simplifying practical adoption.

Abstract

Vision foundation models (VFMs) have emerged as powerful tools for surgical scene understanding. However, current approaches predominantly rely on unimodal RGB pre-training, overlooking the complex 3D geometry inherent to surgical environments. Although several architectures support multimodal or geometry-aware inputs in general computer vision, the benefits of incorporating depth information in surgical settings remain underexplored. We conduct a large-scale empirical study comparing eight ViT-based VFMs that differ in pre-training domain, learning objective, and input modality (RGB vs. RGB-D). For pre-training, we use a curated dataset of 1.4 million robotic surgical images paired with depth maps generated from an off-the-shelf network. We evaluate these models under both frozen-backbone and end-to-end fine-tuning protocols across eight surgical datasets spanning object detection,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSurgical Simulation and Training · Advanced Neural Network Applications · Soft Robotics and Applications