MedResearcher-R1: Expert-Level Medical Deep Researcher via A Knowledge-Informed Trajectory Synthesis Framework
Ailing Yu, Lan Yao, Jingnan Liu, Zhe Chen, Jiajun Yin, Yuan Wang, Xinhao Liao, Zhiling Ye, Ji Li, Yun Yue, Hansong Xiao, Hualei Zhou, Chunxiao Guo, Peng Wei, Junwei Liu, Jinjie Gu

TL;DR
MedResearcher-R1 is a specialized medical deep research agent that leverages knowledge graphs, custom retrieval tools, and a two-stage training process to outperform larger proprietary systems on medical benchmarks.
Contribution
The paper introduces a novel data synthesis framework using medical knowledge graphs and a custom retrieval engine, combined with a two-stage training paradigm, to enhance medical reasoning in deep research agents.
Findings
Achieved state-of-the-art results on medical benchmarks.
Generated over 2100 diverse research trajectories in 12 specialties.
Outperformed larger proprietary systems in medical domain tasks.
Abstract
Recent developments in Large Language Model (LLM)-based agents have shown impressive capabilities spanning multiple domains, exemplified by deep research systems that demonstrate superior performance on complex information-seeking and synthesis tasks. While general-purpose deep research agents have shown impressive capabilities, they struggle significantly with medical domain challenges, as evidenced by leading proprietary systems achieving limited accuracy on complex medical benchmarks. The key limitations are: (1) the model lacks sufficient dense medical knowledge for clinical reasoning, and (2) the framework is constrained by the absence of specialized retrieval tools tailored for medical contexts. We present a medical deep research agent that addresses these challenges through two core innovations. First, we develop a novel data synthesis framework using medical knowledge graphs,…
Peer Reviews
Decision·ICLR 2026 Conference Withdrawn Submission
1. The topic is very interesting and the way to create the knowledge graph is also interesting 2. I believe that think-verify is a good way to get better answer and also SFT is needed in medical reasoning. 3. The model outperforms all baselines only with a small training dataset, which is really exciting.
1. All figures are confusing. 2. It is a little bit hard to follow. 3. Some parts need further clarification. 4. Please clarify on training dataset and the objects in each training stage.
- Authors identifies specific limitations in existing general search engines (lack of dense medical knowledge and inadequate specialized retrieval) and clearly motivate for the problem, which makes the research focus important. - The core contribution of authors lies in their data generation pipeline which integrates KISA and MTG; where KISA constructs structured knowledge graphs to generate complex multi-hop medical questions with adaptive difficulty calibration and MTG tries to improve reasoni
- The overall framework follows a well explored ReAct + GRPO fine-tuning in agents, where the contribution mainly lies in dataset construction rather than algorithmic innovation. - The main technical limitation of the paper lies in the lack of detailed experimental and ablation analyses. Although authors conduct their trainings with SFT followed by RLVR, the training dynamics are overlooked; particularly the individual contributions of SFT and the impact of specific reward design choices are not
S1: Medical deep research is an important question to study and requires specialized training and tool efforts. S2: The paper made several contributions, including KISA, the trajectory synthesis approach, and MedResearcher-R1, through large-scale agent training. S3: The evaluation results look strong.
W1: Although the authors promised to release code and models in the reproducibility statement, there is no anonymous link or code uploaded in the supplementary material. It would be very valuable to see them open-sourced. W2: The related work lacks enough coverage of the ai agents in the medical domain, such as MedAgentGym and MedAgentBench. Please also consider testing MedResearcher-R1 on those benchmarks.
1. The KISA framework for trajectory and dataset synthesis brings ambitious structure and challenge to the often-sparse training of medical research agents. Leveraging rare medical entities and graph-based reasoning, the framework produces questions that simulate authentic, expert-level medical inquiries rather than trivial lookups. 2. The introduction of a private medical retriever elevates retrieval fidelity, while the dynamic tool selection policy usefully blends domain-specific and general r
1. The lack of direct baseline comparisons with professional biomedical large models weakens the analysis. In the manuscript, classifying "Gemini-2.5-Pro-deepsearch" and "o3-deepresearch" as Medical Domain Baselines is inaccurate; genuine Medical Domain Baselines should be incorporated. 2. The main body of the paper presents a deep research model in the medical domain; however, its evaluation in the medical field is limited to the MedBrowseComp benchmark, raising concerns about its generalizabil
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Machine Learning in Healthcare · Artificial Intelligence in Healthcare and Education
