Training LLMs for EHR-Based Reasoning Tasks via Reinforcement Learning
Jiacheng Lin, Zhenbang Wu, Jimeng Sun

TL;DR
This paper introduces EHRMIND, a method combining supervised fine-tuning and reinforcement learning with verifiable rewards to improve large language models' reasoning in electronic health record tasks, addressing knowledge gaps and misapplications.
Contribution
The paper presents a novel two-stage training approach for LLMs that enhances healthcare reasoning by injecting domain knowledge and refining decision-making through RLVR.
Findings
EHRMIND improves accuracy across clinical tasks.
The method enhances interpretability and generalization.
It effectively addresses knowledge gaps and misapplications.
Abstract
We present EHRMIND, a practical recipe for adapting large language models (LLMs) to complex clinical reasoning tasks using reinforcement learning with verifiable rewards (RLVR). While RLVR has succeeded in mathematics and coding, its application to healthcare contexts presents unique challenges due to the specialized knowledge and reasoning required for electronic health record (EHR) interpretation. Our pilot study on the MEDCALC benchmark reveals two key failure modes: (1) misapplied knowledge, where models possess relevant medical knowledge but apply it incorrectly, and (2) missing knowledge, where models lack essential domain knowledge. To address these cases, EHRMIND applies a two-stage solution: a lightweight supervised fine-tuning (SFT) warm-up that injects missing domain knowledge, stabilizes subsequent training, and encourages structured, interpretable outputs; followed by RLVR,…
Peer Reviews
Decision·Submitted to ICLR 2026
- Strong results. The proposed approach improves performance over baselines including both open-weights and commercial LLMs, with a small 3 billion parameters model. - Simple methodology. Overall, the methodology is straightforward and easy to understand/implement, which significantly improves the performance. - Detailed analysis on several benchmarks considering different scenarios (e.g., SFT, RLVR, SFT+RLVR) with practical findings.
- Limited experimentation: Llama-3-8B is used as the initial backbone, yet no other LLMs including same backbone with different scale, or a different LLM with the same scale. Qwen3 models were announced this year May, which should be considered for further evaluating the proposed approach. Although the analyses are detailed, it is necessary to perform these analyses again with different LLMs to make sure that the findings are not specific for Llama-3-8B. - The choice of Llama-3-8B needs to be ju
* The paper presents a Clear, reproducible training recipe (light SFT to RLVR) that practitioners in healthcare could adopt quickly. * Empirical breadth. Sensible evaluation across multiple EHR tasks with granular analyses (seen vs. unseen formulas, class-wise metrics, rationale structure).
* Limited novelty (major). No new RL algorithm or learning objective; contribution is primarily applying known RLVR with domain-specific, verifiable rewards and providing practical caveat/insights for EHR domain. * Comparative fairness. Unclear compute-/sampling-matching against strong proprietary baselines; no comparison on open-source RL baselines. The message delivered seems to be that "RL finetuning on medical reasoning tasks beats pre-training only closed-sourced big models" but that is wel
The paper's primary strength is its practical, diagnostic approach. The framing of the problem as "missing" vs. "misapplied" knowledge is insightful and clearly demonstrated. The proposal to use Pass@k as a simple, compute-cheap indicator to predict the success of pure RL is a novel and valuable contribution. The strong empirical support for this diagnostic (e.g., $R^2=0.91$ in Fig 2, and validated in Sec 4.2 & 4.4) makes this a very convincing and useful "recipe."
1. The claims of reasoning over complex, noisy EHRs are undermined by a critical data filtering step detailed in Appendix F.3.2. To fit the context window, the authors discarded all patient data from the EHRSHOT benchmark except for two event types: condition_occurrence and procedure_occurrence. This means all lab results (measurement), medications (drug_exposure), and clinical notes (note) were ignored. The model is reasoning over a tiny, heavily pre-processed fraction of the EHR, not the full,
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSafety Systems Engineering in Autonomy
