Grounding Surgical Action Triplets with Instrument Instance Segmentation: A Dataset and Target-Aware Fusion Approach

Oluwatosin Alabi; Meng Wei; Charlie Budd; Tom Vercauteren; Miaojing Shi

arXiv:2511.00643·cs.CV·April 9, 2026

Grounding Surgical Action Triplets with Instrument Instance Segmentation: A Dataset and Target-Aware Fusion Approach

Oluwatosin Alabi, Meng Wei, Charlie Budd, Tom Vercauteren, Miaojing Shi

PDF

TL;DR

This paper introduces a new dataset and a novel neural network architecture for spatially grounding surgical instrument actions and targets, significantly improving surgical scene understanding.

Contribution

It presents CholecTriplet-Seg, a large-scale dataset, and TargetFusionNet, a new model that enhances triplet grounding accuracy with target-aware fusion.

Findings

01

TargetFusionNet outperforms existing baselines in recognition and segmentation metrics.

02

The dataset links instrument masks with action and target annotations over 30,000 frames.

03

Strong instance supervision and weak target priors improve surgical action understanding.

Abstract

Understanding surgical instrument-tissue interactions requires not only identifying which instrument performs which action on which anatomical target, but also grounding these interactions spatially within the surgical scene. Existing surgical action triplet recognition methods are limited to learning from frame-level classification, failing to reliably link actions to specific instrument instances.Previous attempts at spatial grounding have primarily relied on class activation maps, which lack the precision and robustness required for detailed instrument-tissue interaction analysis.To address this gap, we propose grounding surgical action triplets with instrument instance segmentation, or triplet segmentation for short, a new unified task which produces spatially grounded <instrument, verb, target> outputs.We start by presenting CholecTriplet-Seg, a large-scale dataset containing over…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.