PanCanBench: A Comprehensive Benchmark for Evaluating Large Language Models in Pancreatic Oncology

Yimin Zhao; Sheela R. Damle; Simone E. Dekker; Scott Geng; Karly Williams Silva; Jesse J Hubbard; Manuel F Fernandez; Fatima Zelada-Arenas; Alejandra Alvarez; Brianne Flores; Alexis Rodriguez; Stephen Salerno; Carrie Wright; Zihao Wang; Pang Wei Koh; Jeffrey T. Leek

arXiv:2603.01343·cs.CL·March 3, 2026

PanCanBench: A Comprehensive Benchmark for Evaluating Large Language Models in Pancreatic Oncology

Yimin Zhao, Sheela R. Damle, Simone E. Dekker, Scott Geng, Karly Williams Silva, Jesse J Hubbard, Manuel F Fernandez, Fatima Zelada-Arenas, Alejandra Alvarez, Brianne Flores, Alexis Rodriguez, Stephen Salerno, Carrie Wright, Zihao Wang, Pang Wei Koh, Jeffrey T. Leek

PDF

Open Access

TL;DR

This paper introduces PanCanBench, a detailed benchmark for evaluating large language models in pancreatic cancer care, emphasizing factual accuracy, clinical completeness, and the impact of web-search integration.

Contribution

It presents a novel, expert-annotated benchmark with 3,130 criteria across 282 patient questions, and evaluates 22 LLMs using a human-in-the-loop rubric framework.

Findings

01

Models vary significantly in clinical completeness and factual accuracy.

02

Hallucination rates can be as high as 53.8% among evaluated models.

03

Web-search does not necessarily improve response quality.

Abstract

Large language models (LLMs) have achieved expert-level performance on standardized examinations, yet multiple-choice accuracy poorly reflects real-world clinical utility and safety. As patients and clinicians increasingly use LLMs for guidance on complex conditions such as pancreatic cancer, evaluation must extend beyond general medical knowledge. Existing frameworks, such as HealthBench, rely on simulated queries and lack disease-specific depth. Moreover, high rubric-based scores do not ensure factual correctness, underscoring the need to assess hallucinations. We developed a human-in-the-loop pipeline to create expert rubrics for de-identified patient questions from the Pancreatic Cancer Action Network (PanCAN). The resulting benchmark, PanCanBench, includes 3,130 question-specific criteria across 282 authentic patient questions. We evaluated 22 proprietary and open-source LLMs using…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsArtificial Intelligence in Healthcare and Education · Pancreatic and Hepatic Oncology Research · Topic Modeling