Disentangling Reasoning and Knowledge in Medical Large Language Models

Rahul Thapa; Qingyang Wu; Kevin Wu; Harrison Zhang; Angela Zhang; Eric Wu; Haotian Ye; Suhana Bedi; Nevin Aresh; Joseph Boen; Shriya Reddy; Ben Athiwaratkun; Shuaiwen Leon Song; James Zou

arXiv:2505.11462·cs.CL·June 25, 2025

Disentangling Reasoning and Knowledge in Medical Large Language Models

Rahul Thapa, Qingyang Wu, Kevin Wu, Harrison Zhang, Angela Zhang, Eric Wu, Haotian Ye, Suhana Bedi, Nevin Aresh, Joseph Boen, Shriya Reddy, Ben Athiwaratkun, Shuaiwen Leon Song, James Zou

PDF

Open Access 3 Models 2 Datasets

TL;DR

This paper separates reasoning and knowledge in medical LLM benchmarks, analyzes model performance on each, and introduces BioMed-R1, a model trained to improve reasoning in medical AI.

Contribution

It introduces a method to distinguish reasoning from knowledge in biomedical QA benchmarks and develops BioMed-R1, a model optimized for reasoning accuracy.

Findings

01

Only 32.8% of questions require complex reasoning

02

Biomedical models show larger gaps between knowledge and reasoning performance

03

BioMed-R1 outperforms similar-sized models on reasoning tasks

Abstract

Medical reasoning in large language models (LLMs) aims to emulate clinicians' diagnostic thinking, but current benchmarks such as MedQA-USMLE, MedMCQA, and PubMedQA often mix reasoning with factual recall. We address this by separating 11 biomedical QA benchmarks into reasoning- and knowledge-focused subsets using a PubMedBERT classifier that reaches 81 percent accuracy, comparable to human performance. Our analysis shows that only 32.8 percent of questions require complex reasoning. We evaluate biomedical models (HuatuoGPT-o1, MedReason, m1) and general-domain models (DeepSeek-R1, o4-mini, Qwen3), finding consistent gaps between knowledge and reasoning performance. For example, HuatuoGPT-o1 scores 56.9 on knowledge but only 44.8 on reasoning. In adversarial tests where models are misled with incorrect initial reasoning, biomedical models degrade sharply, while larger or RL-trained…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsArtificial Intelligence in Healthcare and Education · Machine Learning in Healthcare · Topic Modeling