TL;DR
UniSurgSAM is a versatile, reliable surgical video segmentation model that supports multiple prompt types and addresses key challenges like hallucinations and mask drift, advancing computer-assisted surgery.
Contribution
It introduces a decoupled two-stage framework with novel designs for reliability, enabling real-time, multi-modal surgical video segmentation with state-of-the-art performance.
Findings
Achieves state-of-the-art accuracy across all prompt modalities.
Effectively suppresses hallucinations during target absence.
Prevents mask drift in long surgical sequences.
Abstract
Surgical video segmentation is fundamental to computer-assisted surgery. In practice, surgeons need to dynamically specify targets throughout extended procedures, using heterogeneous cues such as visual selections, textual expressions, or audio instructions. However, existing Promptable Video Object Segmentation (PVOS) methods are typically restricted to a single prompt modality and rely on coupled frameworks that cause optimization interference between target initialization and tracking. Moreover, these methods produce hallucinated predictions when the target is absent and suffer from accumulated mask drift without failure recovery. To address these challenges, we present UniSurgSAM, a unified PVOS model enabling reliable surgical video segmentation through visual, textual, or audio prompts. Specifically, UniSurgSAM employs a decoupled two-stage framework that independently optimizes…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
