ROBUST-MIPS: A Combined Skeletal Pose and Instance Segmentation Dataset for Laparoscopic Surgical Instruments

Zhe Han; Charlie Budd; Gongyu Zhang; Huanyu Tian; Christos Bergeles; Tom Vercauteren

arXiv:2508.21096·cs.CV·March 19, 2026

ROBUST-MIPS: A Combined Skeletal Pose and Instance Segmentation Dataset for Laparoscopic Surgical Instruments

Zhe Han, Charlie Budd, Gongyu Zhang, Huanyu Tian, Christos Bergeles, Tom Vercauteren

PDF

TL;DR

This paper introduces ROBUST-MIPS, a new dataset combining skeletal pose and instance segmentation annotations for laparoscopic surgical instruments, enabling improved tool localization and facilitating research in surgical tool analysis.

Contribution

The paper presents a novel combined dataset with pose and segmentation annotations, along with benchmark models and software to promote adoption and advance surgical tool localization research.

Findings

01

Pose annotations enable effective surgical tool localization.

02

High-quality results achieved with pose estimation methods.

03

Dataset and tools support joint study of pose and segmentation.

Abstract

Localisation of surgical tools constitutes a foundational building block for computer-assisted interventional technologies. Works in this field typically focus on training deep learning models to perform segmentation tasks. Performance of learning-based approaches is limited by the availability of diverse annotated data. We argue that skeletal pose annotations are a more efficient annotation approach for surgical tools, striking a balance between richness of semantic information and ease of annotation, thus allowing for accelerated growth of available annotated data. To encourage adoption of this annotation style, we present, ROBUST-MIPS, a combined tool pose and tool instance segmentation dataset derived from the existing ROBUST-MIS dataset. Our enriched dataset facilitates the joint study of these two annotation styles and allow head-to-head comparison on various downstream tasks. To…

Tables5

Table 1. Table 1: Overview of the skeletal representation of articulated and rigid surgical tools in different states, with the visualisation of each case shown in Figure 2 and Figure 3 .

Tool types	States	Tool representation	Cases
Articulated	all keypoints visible	4 points and 3 lines	Figure 2(a)
	Tips or HingePoint occluded	4 points and 3 lines	Figure 3(d,e,f)
	one tip missing/closed	3 points and 2 lines	Figure 3(a,b)
	only shaft in the FoV	2 points and 1 line	Figure 3(c)
Rigid	all keypoints visible	3 points and 2 lines	Figure 2(b)
	only shaft in the FoV	2 points and 1 line	similar with the Figure 3(c)

Table 2. Table 2: Case distribution of the data with frames per stage and surgery of the ROBUST-MIS dataset. Empty frames (denoted as ef in the table) were classed as the % of frames in which an instrument did not appear.

Procedure	Training	Testing
Procedure	Training	Stage 1	Stage 2	Stage 3
Proctocolectomy	2,943(2% ef.)	325 (11% ef.)	225 (11% ef.)	0
Rectal resection	3,040 (20% ef.)	338 (20% ef.)	289 (15% ef.)	0
Sigmoid resection	0	0	0	2880 (23% ef.)
Total	5983 (17% ef.)	663 (15% ef.)	514 (13% ef.)	2880 (23% ef.)

Table 3. Table 3: Case distribution of the data with frames per stage and surgery of ROBUST-MIPS dataset. The training and validation data come from the same group of patients undergoing two types of surgeries, while the testing set includes data from different patients undergoing the same two surgery types, as well as a third surgery type not present in the training process.

Procedure	Training	Validation	Testing
Proctocolectomy	2,943	325	225
Rectal resection	3,040	338	289
Sigmoid resection	0	0	2880
Total	5983	663	3394

Table 4. Table 4: Parameters of the models.

optimiser	AdamW
base learning rate	0.0005
learning rate schedule	LinearLR
batch size	32(train)
	16(val)
warm-up iterations	500
weight decay	0.01
training epochs	600

Table 5. Table 5: Results of various algorithms for surgical tool pose estimation on the ROBUST-MIPS testing set. SBL stands for SimpleBaseLine [ 26 ] .

Model	Backbone	Resolution	Robust-MIP testing
Model	Backbone	Resolution	AP	AP $_{OKS = 0.5}$	AP $_{OKS = 0.75}$	AR	AR $_{OKS = 0.5}$	AR $_{OKS = 0.75}$
SBL	ResNet152	256x192	0.694	0.819	0.704	0.732	0.834	0.739
SBL	ResNet152	384x288	0.684	0.807	0.694	0.730	0.830	0.740
RTMPose	CSPNext-m	256x192	0.705	0.820	0.716	0.740	0.839	0.748
RTMPose	CSPNext-l	256x192	0.712	0.827	0.722	0.750	0.845	0.758
ViTPose-B	ViT-B	256x192	0.735	0.832	0.750	0.768	0.847	0.778
ViTPose-L	ViT-L	256x192	0.754	0.842	0.771	0.784	0.855	0.796

Equations4

O K S = i \sum [exp (- \frac{d _{i}^{2}}{2 s ^{2} κ _{i}^{2}}) δ (v_{i} > 0)] / i \sum [δ (v_{i} > 0)]

O K S = i \sum [exp (- \frac{d _{i}^{2}}{2 s ^{2} κ _{i}^{2}}) δ (v_{i} > 0)] / i \sum [δ (v_{i} > 0)]

s = \frac{w ^{2} + h ^{2}}{2}

s = \frac{w ^{2} + h ^{2}}{2}

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

ROBUST-MIPS: A Combined Skeletal Pose and Instance Segmentation Dataset for Laparoscopic Surgical Instruments

Zhe Han