Advances in LLM Reasoning Enable Flexibility in Clinical Problem-Solving

Kie Shidara; Preethi Prem; Jonathan Kim; Anna Podlasek; Feng Liu; Ahmed Alaa; Danilo Bernardo

arXiv:2601.11866·cs.CL·January 21, 2026

Advances in LLM Reasoning Enable Flexibility in Clinical Problem-Solving

Kie Shidara, Preethi Prem, Jonathan Kim, Anna Podlasek, Feng Liu, Ahmed Alaa, Danilo Bernardo

PDF

Open Access

TL;DR

Advances in large language models have enhanced their ability to perform flexible and human-like reasoning in medical problem-solving, surpassing previous limitations and reducing reliance on heuristics.

Contribution

This study demonstrates that recent improvements in LLM reasoning capabilities enable greater cognitive flexibility in clinical reasoning tasks, matching human performance on a challenging medical benchmark.

Findings

01

Strong reasoning LLMs outperform weaker models on medical QA.

02

Top models answer 55-70% of questions correctly, even on difficult cases.

03

Models show reduced susceptibility to heuristic traps compared to humans.

Abstract

Large Language Models (LLMs) have achieved high accuracy on medical question-answer (QA) benchmarks, yet their capacity for flexible clinical reasoning has been debated. Here, we asked whether advances in reasoning LLMs improve their cognitive flexibility in clinical reasoning. We assessed reasoning models from the OpenAI, Grok, Gemini, Claude, and DeepSeek families on the medicine abstraction and reasoning corpus (mARC), an adversarial medical QA benchmark which utilizes the Einstellung effect to induce inflexible overreliance on learned heuristic patterns in contexts where they become suboptimal. We found that strong reasoning models avoided Einstellung-based traps more often than weaker reasoning models, achieving human-level performance on mARC. On questions most commonly missed by physicians, the top 5 performing models answered 55% to 70% correctly with high confidence, indicating…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Artificial Intelligence in Healthcare and Education · Clinical Reasoning and Diagnostic Skills