Language Agents for Hypothesis-driven Clinical Decision Making with Reinforcement Learning

David Bani-Harouni; Chantal Pellegrini; Ege \"Ozsoy; Nassir Navab; Matthias Keicher

arXiv:2506.13474·cs.CL·March 3, 2026

Language Agents for Hypothesis-driven Clinical Decision Making with Reinforcement Learning

David Bani-Harouni, Chantal Pellegrini, Ege \"Ozsoy, Nassir Navab, Matthias Keicher

PDF

Open Access 3 Reviews

TL;DR

This paper introduces LA-CDM, a language agent that models iterative, hypothesis-driven clinical decision-making using reinforcement learning, improving diagnostic accuracy and efficiency in real-world datasets.

Contribution

It presents a novel hybrid training approach for language agents to support dynamic clinical diagnosis through iterative hypothesis testing.

Findings

01

LA-CDM improves diagnostic accuracy on MIMIC-CDM dataset.

02

Explicit training enhances clinical decision-making efficiency.

03

The approach models realistic, interactive diagnostic processes.

Abstract

Clinical decision-making is a dynamic, interactive, and cyclic process where doctors have to repeatedly decide on which clinical action to perform and consider newly uncovered information for diagnosis and treatment. Large Language Models (LLMs) have the potential to support clinicians in this process, however, most applications of LLMs in clinical decision support suffer from one of two limitations: Either they assume the unrealistic scenario of immediate availability of all patient information and do not model the interactive and iterative investigation process, or they restrict themselves to the limited "out-of-the-box" capabilities of large pre-trained models without performing task-specific training. In contrast to this, we propose to model clinical decision-making for diagnosis with a hypothesis-driven uncertainty-aware language agent, LA-CDM, that converges towards a diagnosis…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 4Confidence 3

Strengths

1. The paper addresses an important gap in clinical AI systems by modeling the iterative nature of diagnostic reasoning. 2. The paper provides thorough comparison against multiple baselines and includes cost-efficiency metrics. 3. The paper attempts to integrate uncertainty estimation with sequential decision-making in a clinical context. 4. The paper includes detailed prompts and implementation details that facilitate understanding of the approach.

Weaknesses

1. The paper introduces separate hypothesis and decision agents without sufficiently justifying their architectural separation. Specifically: - The hypothesis agent generates diagnostic hypotheses and confidence scores, while the decision agent uses this output to select tests or final diagnoses. However, both agents inherently engage in diagnostic reasoning, and the decision agent could potentially internalize hypothesis generation and confidence estimation. - The confidence score produced by t

Reviewer 02Rating 6Confidence 3

Strengths

1. **Innovative design:** Two-agent division closely reflects clinical reasoning loops. 2. **Calibration reward:** Improves reliability of verbal confidence estimates. 3. **Cost-sensitive optimization:** Demonstrates efficiency improvements without sacrificing accuracy. 4. **Transparent methodology:** Prompts, cost tables, and training configs are public. 5. **Multimodal context:** Integrates notes, labs, and imaging text within a unified environment. 6. **Clear ablations:** Each modul

Weaknesses

1. **Limited dataset scope:** Only four conditions; no cross-domain or cross-hospital validation. 2. **Environment artifacts:** Some requested tests are unavailable, possibly biasing learning. 3. **Weak baselines:** Comparison methods are not fully aligned in modality or setting. 4. **Preprocessing bias:** Summarization of clinical notes may distort reasoning cues. 5. **Safety unaddressed:** No human-in-the-loop validation or fail-safe mechanism. 6. **Ablation granularity:** Lacks expl

Reviewer 03Rating 4Confidence 4

Strengths

1. The paper's theme is based on real-world clinical problems. The design of the two agents reflects the actual diagnostic process: hypothesis formation → targeted testing → decision-making. 2.The paper demonstrates a strong focus on cost. The R_cost term explicitly penalizes expensive tests (e.g., MRI USD4,866 vs CBC USD71), driving parsimonious testing. 3.The paper has principles for handling uncertainty. Confidence calibration via RL (betting-style reward) trains the model to express calibrat

Weaknesses

1.This study's evaluation is limited to four abdominal diseases based on a single retrospective dataset; it lacks broader disease coverage or external validation. It lacks generalizability. 2.The data in the paper has limitations in retrospective constraints and test availability bias. Many requested tests are unavailable in logs, forcing the agent to try alternatives; this caps exploration to clinician-observed pathways and can bias learned policies. 3.The paper preprocesses patient information

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsIntelligent Tutoring Systems and Adaptive Learning · Machine Learning in Healthcare