TL;DR
This study demonstrates that modern vision foundation models significantly enhance data-efficient surgical phase segmentation in small-incision cataract surgery, especially in low-label settings, by leveraging transferability and lightweight adaptation.
Contribution
It provides a controlled comparison showing foundation models outperform traditional encoders in surgical video segmentation and offers practical insights for low-label medical video applications.
Findings
DINOv3 ViT-7B achieves 83.4% accuracy in segmentation.
Foundation models improve performance over supervised encoders.
Lightweight adaptation benefits transfer learning in surgical videos.
Abstract
Surgical phase segmentation is central to computer-assisted surgery, yet robust models remain difficult to develop when labeled surgical videos are scarce. We study data-efficient phase segmentation for manual small-incision cataract surgery (SICS) through a controlled comparison of visual representations. To isolate representation quality, we pair each visual encoder with the same temporal model (MS-TCN++) under identical training and evaluation settings on SICS-155 (19 phases). We compare supervised encoders (ResNet-50, I3D) against large self-supervised foundation models (DINOv3, V-JEPA2), and use a cached-feature pipeline that decouples expensive visual encoding from lightweight temporal learning. Foundation-model features improve segmentation performance in this setup, with DINOv3 ViT-7B achieving the best overall results (83.4% accuracy, 87.0 edit score). We further examine…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
