PANDA-PLUS-Bench: A Clinical Benchmark for Evaluating Robustness of AI Foundation Models in Prostate Cancer Diagnosis

Joshua L. Ebbert; Dennis Della Corte

arXiv:2512.14922·cs.CV·December 18, 2025

PANDA-PLUS-Bench: A Clinical Benchmark for Evaluating Robustness of AI Foundation Models in Prostate Cancer Diagnosis

Joshua L. Ebbert, Dennis Della Corte

PDF

Open Access

TL;DR

PANDA-PLUS-Bench is a new curated dataset and benchmark designed to evaluate the robustness of AI models in prostate cancer diagnosis, specifically assessing their ability to distinguish biological signals from slide-specific artifacts.

Contribution

The paper introduces PANDA-PLUS-Bench, a specialized benchmark dataset for evaluating foundation models' robustness in prostate cancer Gleason grading, highlighting variability in model performance and emphasizing the importance of tissue-specific training.

Findings

01

HistoEncoder achieved the highest cross-slide accuracy (59.7%)

02

Large models showed significant slide-level encoding variability

03

All models exhibited within-slide vs. cross-slide accuracy gaps

Abstract

Artificial intelligence foundation models are increasingly deployed for prostate cancer Gleason grading, where GP3/GP4 distinction directly impacts treatment decisions. However, these models may achieve high validation accuracy by learning specimen-specific artifacts rather than generalizable biological features, limiting real-world clinical utility. We introduce PANDA-PLUS-Bench, a curated benchmark dataset derived from expert-annotated prostate biopsies designed specifically to quantify this failure mode. The benchmark comprises nine carefully selected whole slide images from nine unique patients containing diverse Gleason patterns, with non-overlapping tissue patches extracted at both 512x512 and 224x224 pixel resolutions across eight augmentation conditions. Using this benchmark, we evaluate seven foundation models on their ability to separate biological signal from slide-level…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsProstate Cancer Diagnosis and Treatment · AI in cancer detection · Artificial Intelligence in Healthcare and Education