Do segmentation metrics reflect clinical reality? A surgeon-centered evaluation in robot-assisted minimally invasive esophagectomy

Ronald de Jong; Yiping Li; Romy van Jaarsveld; Gino Kuiper; Richard van Hillegersberg; Jelle Ruurda; Josien Pluim; Marcel Breeuwer; Yasmina Al Khalil

PMC · DOI:10.1007/s00464-025-12266-3·October 10, 2025

Do segmentation metrics reflect clinical reality? A surgeon-centered evaluation in robot-assisted minimally invasive esophagectomy

Ronald de Jong, Yiping Li, Romy van Jaarsveld, Gino Kuiper, Richard van Hillegersberg, Jelle Ruurda, Josien Pluim, Marcel Breeuwer, Yasmina Al Khalil

PDF

Open Access

TL;DR

This study evaluates how well AI segmentation metrics align with surgeon assessments in robot-assisted esophagectomy procedures.

Contribution

The study introduces surgeon-centered evaluation of segmentation metrics in RAMIE, revealing gaps between quantitative metrics and clinical relevance.

Findings

01

Overlap and temporal metrics best align with surgeon assessments of anatomical overlays.

02

Novice surgeons show weaker correlations and tend to rate overlays more leniently.

03

Qualitative feedback highlights hallucinations and instability missed by current metrics.

Abstract

Deep learning-based anatomy segmentation holds promise for improving real-time guidance in complex surgeries such as robot-assisted minimally invasive esophagectomy (RAMIE). However, the clinical relevance of commonly used metrics for evaluating segmentation quality remains unclear, as previous assessments have lacked direct input from surgeons. This study aims to assess how well quantitative segmentation metrics reflect surgeons’ assessments of anatomical overlay accuracy and clinical usefulness during RAMIE. We conducted a survey involving 26 upper gastrointestinal surgeons, including both trainee and attending surgeons, who assessed video clips of RAMIE procedures featuring deep learning-generated anatomical overlays. We correlated the surgeons’ qualitative evaluations of annotation accuracy and clinical usefulness with a comprehensive set of quantitative metrics, including overlap,…

Linked entities

Genes, proteins, chemicals, diseases, species, mutations and cell lines named across the full text — each resolved to its canonical identifier and authoritative record.

Diseases1

hallucinations

Figures9

Click any figure to enlarge with its caption.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsRadiomics and Machine Learning in Medical Imaging · Artificial Intelligence in Healthcare and Education · Advanced X-ray and CT Imaging