Touchstone Benchmark: Are We on the Right Way for Evaluating AI   Algorithms for Medical Segmentation?

Pedro R. A. S. Bassi; Wenxuan Li; Yucheng Tang; Fabian Isensee; Zifu; Wang; Jieneng Chen; Yu-Cheng Chou; Yannick Kirchhoff; Maximilian Rokuss,; Ziyan Huang; Jin Ye; Junjun He; Tassilo Wald; Constantin Ulrich; Michael; Baumgartner; Saikat Roy; Klaus H. Maier-Hein; Paul Jaeger; Yiwen Ye; Yutong; Xie; Jianpeng Zhang; Ziyang Chen; Yong Xia; Zhaohu Xing; Lei Zhu; Yousef; Sadegheih; Afshin Bozorgpour; Pratibha Kumari; Reza Azad; Dorit Merhof,; Pengcheng Shi; Ting Ma; Yuxin Du; Fan Bai; Tiejun Huang; Bo Zhao; Haonan; Wang; Xiaomeng Li; Hanxue Gu; Haoyu Dong; Jichen Yang; Maciej A. Mazurowski,; Saumya Gupta; Linshan Wu; Jiaxin Zhuang; Hao Chen; Holger Roth; Daguang Xu,; Matthew B. Blaschko; Sergio Decherchi; Andrea Cavalli; Alan L. Yuille,; Zongwei Zhou

arXiv:2411.03670·cs.CV·January 22, 2025·2 cites

Touchstone Benchmark: Are We on the Right Way for Evaluating AI Algorithms for Medical Segmentation?

Pedro R. A. S. Bassi, Wenxuan Li, Yucheng Tang, Fabian Isensee, Zifu, Wang, Jieneng Chen, Yu-Cheng Chou, Yannick Kirchhoff, Maximilian Rokuss,, Ziyan Huang, Jin Ye, Junjun He, Tassilo Wald, Constantin Ulrich, Michael, Baumgartner, Saikat Roy, Klaus H. Maier-Hein, Paul Jaeger

PDF

Open Access 1 Repo 5 Datasets

TL;DR

The Touchstone Benchmark provides a large, diverse, and rigorous evaluation platform for AI algorithms in medical segmentation, addressing limitations of existing benchmarks and promoting real-world applicability.

Contribution

It introduces a large-scale, multi-hospital dataset and a third-party evaluation process to improve the assessment of AI algorithms for medical segmentation.

Findings

01

Enhanced statistical significance of results

02

Evaluation across out-of-distribution scenarios

03

Comparison of multiple AI frameworks and algorithms

Abstract

How can we test AI performance? This question seems trivial, but it isn't. Standard benchmarks often have problems such as in-distribution and small-size test sets, oversimplified metrics, unfair comparisons, and short-term outcome pressure. As a consequence, good performance on standard benchmarks does not guarantee success in real-world scenarios. To address these problems, we present Touchstone, a large-scale collaborative segmentation benchmark of 9 types of abdominal organs. This benchmark is based on 5,195 training CT scans from 76 hospitals around the world and 5,903 testing CT scans from 11 additional hospitals. This diverse test set enhances the statistical significance of benchmark results and rigorously evaluates AI algorithms across various out-of-distribution scenarios. We invited 14 inventors of 19 AI algorithms to train their algorithms, while our team, as a third party,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

mrgiovanni/touchstone
noneOfficial

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMedical Imaging and Analysis · Artificial Intelligence in Healthcare and Education

MethodsSparse Evolutionary Training