EndoARSS: Adapting Spatially-Aware Foundation Model for Efficient Activity Recognition and Semantic Segmentation in Endoscopic Surgery

Guankun Wang; Rui Tang; Mengya Xu; Long Bai; Huxin Gao; and Hongliang Ren

arXiv:2506.06830·cs.CV·June 10, 2025

EndoARSS: Adapting Spatially-Aware Foundation Model for Efficient Activity Recognition and Semantic Segmentation in Endoscopic Surgery

Guankun Wang, Rui Tang, Mengya Xu, Long Bai, Huxin Gao, and Hongliang Ren

PDF

Open Access

TL;DR

EndoARSS is a multi-task learning framework built on DINOv2, designed to improve activity recognition and semantic segmentation in endoscopic surgery by leveraging spatially-aware multi-scale attention and efficient fine-tuning techniques.

Contribution

The paper introduces EndoARSS, a novel multi-task learning approach that combines Low-Rank Adaptation and spatially-aware multi-scale attention for enhanced surgical scene understanding.

Findings

01

Significantly outperforms existing models in accuracy and robustness.

02

Introduces three new datasets for endoscopic surgery analysis.

03

Demonstrates effective multi-task learning for complex surgical environments.

Abstract

Endoscopic surgery is the gold standard for robotic-assisted minimally invasive surgery, offering significant advantages in early disease detection and precise interventions. However, the complexity of surgical scenes, characterized by high variability in different surgical activity scenarios and confused image features between targets and the background, presents challenges for surgical environment understanding. Traditional deep learning models often struggle with cross-activity interference, leading to suboptimal performance in each downstream task. To address this limitation, we explore multi-task learning, which utilizes the interrelated features between tasks to enhance overall task performance. In this paper, we propose EndoARSS, a novel multi-task learning framework specifically designed for endoscopy surgery activity recognition and semantic segmentation. Built upon the DINOv2…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSurgical Simulation and Training · Multimodal Machine Learning Applications · Soft Robotics and Applications