MedResearcher-R1: Expert-Level Medical Deep Researcher via A Knowledge-Informed Trajectory Synthesis Framework

Ailing Yu; Lan Yao; Jingnan Liu; Zhe Chen; Jiajun Yin; Yuan Wang; Xinhao Liao; Zhiling Ye; Ji Li; Yun Yue; Hansong Xiao; Hualei Zhou; Chunxiao Guo; Peng Wei; Junwei Liu; Jinjie Gu

arXiv:2508.14880·cs.CL·September 3, 2025

MedResearcher-R1: Expert-Level Medical Deep Researcher via A Knowledge-Informed Trajectory Synthesis Framework

Ailing Yu, Lan Yao, Jingnan Liu, Zhe Chen, Jiajun Yin, Yuan Wang, Xinhao Liao, Zhiling Ye, Ji Li, Yun Yue, Hansong Xiao, Hualei Zhou, Chunxiao Guo, Peng Wei, Junwei Liu, Jinjie Gu

PDF

Open Access 1 Models 4 Reviews

TL;DR

MedResearcher-R1 is a specialized medical deep research agent that leverages knowledge graphs, custom retrieval tools, and a two-stage training process to outperform larger proprietary systems on medical benchmarks.

Contribution

The paper introduces a novel data synthesis framework using medical knowledge graphs and a custom retrieval engine, combined with a two-stage training paradigm, to enhance medical reasoning in deep research agents.

Findings

01

Achieved state-of-the-art results on medical benchmarks.

02

Generated over 2100 diverse research trajectories in 12 specialties.

03

Outperformed larger proprietary systems in medical domain tasks.

Abstract

Recent developments in Large Language Model (LLM)-based agents have shown impressive capabilities spanning multiple domains, exemplified by deep research systems that demonstrate superior performance on complex information-seeking and synthesis tasks. While general-purpose deep research agents have shown impressive capabilities, they struggle significantly with medical domain challenges, as evidenced by leading proprietary systems achieving limited accuracy on complex medical benchmarks. The key limitations are: (1) the model lacks sufficient dense medical knowledge for clinical reasoning, and (2) the framework is constrained by the absence of specialized retrieval tools tailored for medical contexts. We present a medical deep research agent that addresses these challenges through two core innovations. First, we develop a novel data synthesis framework using medical knowledge graphs,…

Peer Reviews

Decision·ICLR 2026 Conference Withdrawn Submission

Reviewer 01Rating 4Confidence 5

Strengths

1. The topic is very interesting and the way to create the knowledge graph is also interesting 2. I believe that think-verify is a good way to get better answer and also SFT is needed in medical reasoning. 3. The model outperforms all baselines only with a small training dataset, which is really exciting.

Weaknesses

1. All figures are confusing. 2. It is a little bit hard to follow. 3. Some parts need further clarification. 4. Please clarify on training dataset and the objects in each training stage.

Reviewer 02Rating 2Confidence 3

Strengths

- Authors identifies specific limitations in existing general search engines (lack of dense medical knowledge and inadequate specialized retrieval) and clearly motivate for the problem, which makes the research focus important. - The core contribution of authors lies in their data generation pipeline which integrates KISA and MTG; where KISA constructs structured knowledge graphs to generate complex multi-hop medical questions with adaptive difficulty calibration and MTG tries to improve reasoni

Weaknesses

- The overall framework follows a well explored ReAct + GRPO fine-tuning in agents, where the contribution mainly lies in dataset construction rather than algorithmic innovation. - The main technical limitation of the paper lies in the lack of detailed experimental and ablation analyses. Although authors conduct their trainings with SFT followed by RLVR, the training dynamics are overlooked; particularly the individual contributions of SFT and the impact of specific reward design choices are not

Reviewer 03Rating 6Confidence 3

Strengths

S1: Medical deep research is an important question to study and requires specialized training and tool efforts. S2: The paper made several contributions, including KISA, the trajectory synthesis approach, and MedResearcher-R1, through large-scale agent training. S3: The evaluation results look strong.

Weaknesses

W1: Although the authors promised to release code and models in the reproducibility statement, there is no anonymous link or code uploaded in the supplementary material. It would be very valuable to see them open-sourced. W2: The related work lacks enough coverage of the ai agents in the medical domain, such as MedAgentGym and MedAgentBench. Please also consider testing MedResearcher-R1 on those benchmarks.

Reviewer 04Rating 4Confidence 4

Strengths

1. The KISA framework for trajectory and dataset synthesis brings ambitious structure and challenge to the often-sparse training of medical research agents. Leveraging rare medical entities and graph-based reasoning, the framework produces questions that simulate authentic, expert-level medical inquiries rather than trivial lookups. 2. The introduction of a private medical retriever elevates retrieval fidelity, while the dynamic tool selection policy usefully blends domain-specific and general r

Weaknesses

1. The lack of direct baseline comparisons with professional biomedical large models weakens the analysis. In the manuscript, classifying "Gemini-2.5-Pro-deepsearch" and "o3-deepresearch" as Medical Domain Baselines is inaccurate; genuine Medical Domain Baselines should be incorporated. 2. The main body of the paper presents a deep research model in the medical domain; however, its evaluation in the medical field is limited to the MedBrowseComp benchmark, raising concerns about its generalizabil

Code & Models

Models

🤗
AQ-MedAI/MedResearcher-R1-32B
model· 7 dl· ♡ 32
7 dl♡ 32

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Machine Learning in Healthcare · Artificial Intelligence in Healthcare and Education