Cognitive bias in LLM reasoning compromises interpretation of clinical oncology notes

Matthew W. Kenaston (1); Umair Ayub (1); Mihir Parmar (2); Muhammad Umair Anjum (1); Syed Arsalan Ahmed Naqvi (1); Priya Kumar (1); Samarth Rawal (1); Aadel A. Chaudhuri (4); Yousef Zakharia (3); Elizabeth I. Heath (5); Tanios S. Bekaii-Saab (3); Cui Tao (6); Eliezer M. Van Allen (7); Ben Zhou (2); YooJung Choi (2); Chitta Baral (2); and Irbaz Bin Riaz (1; 3; 6) ((1) Mayo Clinic College of Medicine; Science; Phoenix; AZ; (2) School of Computing; AI; Arizona State University; Tempe; AZ; (3) Mayo Clinic Comprehensive Cancer Center; Phoenix; AZ; (4) Department of Radiation Oncology; Mayo Clinic; Rochester; MN; (5) Department of Oncology; Mayo Clinic; Rochester; MN; (6) Department of Artificial Intelligence; Informatics; Mayo Clinic; Rochester; MN; (7) Dana-Farber Cancer Institute; Harvard Medical School; Boston; MA)

arXiv:2511.20680·cs.CL·November 27, 2025

Cognitive bias in LLM reasoning compromises interpretation of clinical oncology notes

Matthew W. Kenaston (1), Umair Ayub (1), Mihir Parmar (2), Muhammad Umair Anjum (1), Syed Arsalan Ahmed Naqvi (1), Priya Kumar (1), Samarth Rawal (1), Aadel A. Chaudhuri (4), Yousef Zakharia (3), Elizabeth I. Heath (5), Tanios S. Bekaii-Saab (3), Cui Tao (6)

PDF

Open Access

TL;DR

This study reveals that large language models often make reasoning errors in clinical oncology notes, which can lead to unsafe recommendations, highlighting the need for improved evaluation frameworks before clinical use.

Contribution

The paper introduces a hierarchical taxonomy of reasoning errors in LLMs applied to oncology, linking computational failures to cognitive biases and validating it across multiple cancer types.

Findings

01

Reasoning errors occurred in 23% of interpretations.

02

Confirmation and anchoring biases were most common.

03

Errors were associated with guideline-discordant and harmful recommendations.

Abstract

Despite high performance on clinical benchmarks, large language models may reach correct conclusions through faulty reasoning, a failure mode with safety implications for oncology decision support that is not captured by accuracy-based evaluation. In this two-cohort retrospective study, we developed a hierarchical taxonomy of reasoning errors from GPT-4 chain-of-thought responses to real oncology notes and tested its clinical relevance. Using breast and pancreatic cancer notes from the CORAL dataset, we annotated 600 reasoning traces to define a three-tier taxonomy mapping computational failures to cognitive bias frameworks. We validated the taxonomy on 822 responses from prostate cancer consult notes spanning localized through metastatic disease, simulating extraction, analysis, and clinical recommendation tasks. Reasoning errors occurred in 23 percent of interpretations and dominated…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Artificial Intelligence in Healthcare and Education · Clinical Reasoning and Diagnostic Skills