Surgical Scene Understanding in the Era of Foundation AI Models: A Comprehensive Review

Ufaq Khan; Umair Nawaz; Adnan Qayyum; Shazad Ashraf; Yutong Xie; Muhammad Haris Khan; Muhammad Bilal; and Junaid Qadir

arXiv:2502.14886·cs.CV·November 4, 2025

Surgical Scene Understanding in the Era of Foundation AI Models: A Comprehensive Review

Ufaq Khan, Umair Nawaz, Adnan Qayyum, Shazad Ashraf, Yutong Xie, Muhammad Haris Khan, Muhammad Bilal, and Junaid Qadir

PDF

TL;DR

This comprehensive review discusses how recent foundation AI models and deep learning techniques are transforming surgical scene understanding, highlighting advancements, challenges, and future directions for clinical integration.

Contribution

It provides an extensive survey of state-of-the-art ML and DL technologies, including foundation models, in improving surgical scene analysis and discusses their clinical implications.

Findings

01

Significant progress in segmentation, tracking, and phase recognition.

02

Challenges include data variability and computational demands.

03

Future research needed for seamless clinical integration.

Abstract

Recent advancements in machine learning (ML) and deep learning (DL), particularly through the introduction of Foundation Models (FMs), have significantly enhanced surgical scene understanding within minimally invasive surgery (MIS). This paper surveys the integration of state-of-the-art ML and DL technologies, including Convolutional Neural Networks (CNNs), Vision Transformers (ViTs), and Foundation Models like the Segment Anything Model (SAM), into surgical workflows. These technologies improve segmentation accuracy, instrument tracking, and phase recognition in surgical scene understanding. The paper explores the challenges these technologies face, such as data variability and computational demands, and discusses ethical considerations and integration hurdles in clinical settings. Highlighting the roles of FMs, we bridge the technological capabilities with clinical needs and outline…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.