PANDA-PLUS-Bench: A Clinical Benchmark for Evaluating Robustness of AI Foundation Models in Prostate Cancer Diagnosis
Joshua L. Ebbert, Dennis Della Corte

TL;DR
PANDA-PLUS-Bench is a new curated dataset and benchmark designed to evaluate the robustness of AI models in prostate cancer diagnosis, specifically assessing their ability to distinguish biological signals from slide-specific artifacts.
Contribution
The paper introduces PANDA-PLUS-Bench, a specialized benchmark dataset for evaluating foundation models' robustness in prostate cancer Gleason grading, highlighting variability in model performance and emphasizing the importance of tissue-specific training.
Findings
HistoEncoder achieved the highest cross-slide accuracy (59.7%)
Large models showed significant slide-level encoding variability
All models exhibited within-slide vs. cross-slide accuracy gaps
Abstract
Artificial intelligence foundation models are increasingly deployed for prostate cancer Gleason grading, where GP3/GP4 distinction directly impacts treatment decisions. However, these models may achieve high validation accuracy by learning specimen-specific artifacts rather than generalizable biological features, limiting real-world clinical utility. We introduce PANDA-PLUS-Bench, a curated benchmark dataset derived from expert-annotated prostate biopsies designed specifically to quantify this failure mode. The benchmark comprises nine carefully selected whole slide images from nine unique patients containing diverse Gleason patterns, with non-overlapping tissue patches extracted at both 512x512 and 224x224 pixel resolutions across eight augmentation conditions. Using this benchmark, we evaluate seven foundation models on their ability to separate biological signal from slide-level…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsProstate Cancer Diagnosis and Treatment · AI in cancer detection · Artificial Intelligence in Healthcare and Education
