MedProbeBench: Systematic Benchmarking at Deep Evidence Integration for Expert-level Medical Guideline

Jiyao Liu; Jianghan Shen; Sida Song; Tianbin Li; Xiaojia Liu; Rongbin Li; Ziyan Huang; Jiashi Lin; Junzhi Ning; Changkai Ji; Siqi Luo; Wenjie Li; Chenglong Ma; Ming Hu; Jing Xiong; Jin Ye; Bin Fu; Ningsheng Xu; Yirong Chen; Lei Jin; Hong Chen; and Junjun He

arXiv:2604.18418·cs.CV·April 21, 2026

MedProbeBench: Systematic Benchmarking at Deep Evidence Integration for Expert-level Medical Guideline

Jiyao Liu, Jianghan Shen, Sida Song, Tianbin Li, Xiaojia Liu, Rongbin Li, Ziyan Huang, Jiashi Lin, Junzhi Ning, Changkai Ji, Siqi Luo, Wenjie Li, Chenglong Ma, Ming Hu, Jing Xiong, Jin Ye, Bin Fu, Ningsheng Xu, Yirong Chen, Lei Jin, Hong Chen, and Junjun He

PDF

1 Repo

TL;DR

MedProbeBench is a new benchmark designed to evaluate large language models' ability to perform multi-step evidence integration and expert-level reasoning in medical guideline development.

Contribution

It introduces MedProbeBench, the first benchmark using high-quality clinical guidelines for evaluating deep evidence integration in medical AI systems.

Findings

01

Current models show significant gaps in evidence integration and guideline generation.

02

Evaluation reveals the need for improved reasoning and verification capabilities in medical AI.

03

MedProbeBench provides a comprehensive framework for assessing expert-level medical reasoning.

Abstract

Recent advances in deep research systems enable large language models to retrieve, synthesize, and reason over large-scale external knowledge. In medicine, developing clinical guidelines critically depends on such deep evidence integration. However, existing benchmarks fail to evaluate this capability in realistic workflows requiring multi-step evidence integration and expert-level judgment. To address this gap, we introduce MedProbeBench, the first benchmark leveraging high-quality clinical guidelines as expert-level references. Medical guidelines, with their rigorous standards in neutrality and verifiability, represent the pinnacle of medical expertise and pose substantial challenges for deep research agents. For evaluation, we propose MedProbe-Eval, a comprehensive evaluation framework featuring: (1) Holistic Rubrics with 1,200+ task-adaptive rubric criteria for comprehensive quality…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

uni-medical/MedProbeBench
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.