Limitations of Large Language Models in Clinical Problem-Solving Arising from Inflexible Reasoning

Jonathan Kim; Anna Podlasek; Kie Shidara; Feng Liu; Ahmed Alaa; Danilo Bernardo

arXiv:2502.04381·cs.CL·November 13, 2025

Limitations of Large Language Models in Clinical Problem-Solving Arising from Inflexible Reasoning

Jonathan Kim, Anna Podlasek, Kie Shidara, Feng Liu, Ahmed Alaa, Danilo Bernardo

PDF

1 Datasets

TL;DR

This paper investigates the limitations of large language models in clinical problem-solving, revealing their inflexibility, overconfidence, and poor reasoning compared to physicians, through a specialized benchmark called M-ARC.

Contribution

The study introduces M-ARC, a novel benchmark that exposes LLMs' reasoning failures in clinical scenarios, highlighting their inflexibility and overconfidence in medical tasks.

Findings

01

LLMs perform poorly compared to physicians on M-ARC.

02

LLMs often lack commonsense medical reasoning and hallucinate.

03

LLMs exhibit overconfidence despite limited accuracy.

Abstract

Large Language Models (LLMs) have attained human-level accuracy on medical question-answer (QA) benchmarks. However, their limitations in navigating open-ended clinical scenarios have recently been shown, raising concerns about the robustness and generalizability of LLM reasoning across diverse, real-world medical tasks. To probe potential LLM failure modes in clinical problem-solving, we present the medical abstraction and reasoning corpus (M-ARC). M-ARC assesses clinical reasoning through scenarios designed to exploit the Einstellung effect -- the fixation of thought arising from prior experience, targeting LLM inductive biases toward inflexible pattern matching from their training data rather than engaging in flexible reasoning. We find that LLMs, including current state-of-the-art o1 and Gemini models, perform poorly compared to physicians on M-ARC, often demonstrating lack of…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

mkieffer/M-ARC
dataset· 56 dl
56 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.