A-Eval: A Benchmark for Cross-Dataset Evaluation of Abdominal   Multi-Organ Segmentation

Ziyan Huang; Zhongying Deng; Jin Ye; Haoyu Wang; Yanzhou; Su; Tianbin Li; Hui Sun; Junlong Cheng; Jianpin Chen; Junjun; He; Yun Gu; Shaoting Zhang; Lixu Gu; Yu Qiao

arXiv:2309.03906·eess.IV·February 20, 2025·6 cites

A-Eval: A Benchmark for Cross-Dataset Evaluation of Abdominal Multi-Organ Segmentation

Ziyan Huang, Zhongying Deng, Jin Ye, Haoyu Wang, Yanzhou, Su, Tianbin Li, Hui Sun, Junlong Cheng, Jianpin Chen, Junjun, He, Yun Gu, Shaoting Zhang, Lixu Gu, Yu Qiao

PDF

Open Access 2 Repos

TL;DR

This paper introduces A-Eval, a comprehensive benchmark for assessing the cross-dataset generalization of abdominal multi-organ segmentation models, analyzing various training strategies and model sizes across multiple large-scale datasets.

Contribution

It presents the A-Eval benchmark, enabling systematic evaluation of model generalization across diverse datasets and training scenarios in abdominal multi-organ segmentation.

Findings

01

Models trained on large datasets show improved generalization.

02

Data usage strategies significantly impact model performance.

03

Larger models tend to generalize better across datasets.

Abstract

Although deep learning have revolutionized abdominal multi-organ segmentation, models often struggle with generalization due to training on small, specific datasets. With the recent emergence of large-scale datasets, some important questions arise: \textbf{Can models trained on these datasets generalize well on different ones? If yes/no, how to further improve their generalizability?} To address these questions, we introduce A-Eval, a benchmark for the cross-dataset Evaluation ('Eval') of Abdominal ('A') multi-organ segmentation. We employ training sets from four large-scale public datasets: FLARE22, AMOS, WORD, and TotalSegmentator, each providing extensive labels for abdominal multi-organ segmentation. For evaluation, we incorporate the validation sets from these datasets along with the training set from the BTCV dataset, forming a robust benchmark comprising five distinct datasets.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsColorectal Cancer Screening and Detection · Artificial Intelligence in Healthcare and Education · Autopsy Techniques and Outcomes

MethodsFocus