Touchstone Benchmark: Are We on the Right Way for Evaluating AI Algorithms for Medical Segmentation?
Pedro R. A. S. Bassi, Wenxuan Li, Yucheng Tang, Fabian Isensee, Zifu, Wang, Jieneng Chen, Yu-Cheng Chou, Yannick Kirchhoff, Maximilian Rokuss,, Ziyan Huang, Jin Ye, Junjun He, Tassilo Wald, Constantin Ulrich, Michael, Baumgartner, Saikat Roy, Klaus H. Maier-Hein, Paul Jaeger

TL;DR
The Touchstone Benchmark provides a large, diverse, and rigorous evaluation platform for AI algorithms in medical segmentation, addressing limitations of existing benchmarks and promoting real-world applicability.
Contribution
It introduces a large-scale, multi-hospital dataset and a third-party evaluation process to improve the assessment of AI algorithms for medical segmentation.
Findings
Enhanced statistical significance of results
Evaluation across out-of-distribution scenarios
Comparison of multiple AI frameworks and algorithms
Abstract
How can we test AI performance? This question seems trivial, but it isn't. Standard benchmarks often have problems such as in-distribution and small-size test sets, oversimplified metrics, unfair comparisons, and short-term outcome pressure. As a consequence, good performance on standard benchmarks does not guarantee success in real-world scenarios. To address these problems, we present Touchstone, a large-scale collaborative segmentation benchmark of 9 types of abdominal organs. This benchmark is based on 5,195 training CT scans from 76 hospitals around the world and 5,903 testing CT scans from 11 additional hospitals. This diverse test set enhances the statistical significance of benchmark results and rigorously evaluates AI algorithms across various out-of-distribution scenarios. We invited 14 inventors of 19 AI algorithms to train their algorithms, while our team, as a third party,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMedical Imaging and Analysis · Artificial Intelligence in Healthcare and Education
MethodsSparse Evolutionary Training
