Benchmarking CNN- and Transformer-Based Models for Surgical Instrument Segmentation in Robotic-Assisted Surgery
Sara Ameli

TL;DR
This paper benchmarks CNN and Transformer models for surgical instrument segmentation in robotic surgery, comparing their performance on a real-world dataset to guide model selection in surgical AI.
Contribution
It provides a comprehensive comparison of five deep learning architectures, highlighting the strengths of transformer-based models in complex surgical scenes.
Findings
DeepLabV3 achieves performance comparable to SegFormer.
Transformer architectures improve global context understanding.
Convolutional models like UNet perform strongly as baselines.
Abstract
Accurate segmentation of surgical instruments in robotic-assisted surgery is critical for enabling context-aware computer-assisted interventions, such as tool tracking, workflow analysis, and autonomous decision-making. In this study, we benchmark five deep learning architectures-UNet, UNet, DeepLabV3, Attention UNet, and SegFormer on the SAR-RARP50 dataset for multi-class semantic segmentation of surgical instruments in real-world radical prostatectomy videos. The models are trained with a compound loss function combining Cross Entropy and Dice loss to address class imbalance and capture fine object boundaries. Our experiments reveal that while convolutional models such as UNet and Attention UNet provide strong baseline performance, DeepLabV3 achieves results comparable to SegFormer, demonstrating the effectiveness of atrous convolution and multi-scale context aggregation in capturing…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
