MedProbeBench: Systematic Benchmarking at Deep Evidence Integration for Expert-level Medical Guideline
Jiyao Liu, Jianghan Shen, Sida Song, Tianbin Li, Xiaojia Liu, Rongbin Li, Ziyan Huang, Jiashi Lin, Junzhi Ning, Changkai Ji, Siqi Luo, Wenjie Li, Chenglong Ma, Ming Hu, Jing Xiong, Jin Ye, Bin Fu, Ningsheng Xu, Yirong Chen, Lei Jin, Hong Chen, and Junjun He

TL;DR
MedProbeBench is a new benchmark designed to evaluate large language models' ability to perform multi-step evidence integration and expert-level reasoning in medical guideline development.
Contribution
It introduces MedProbeBench, the first benchmark using high-quality clinical guidelines for evaluating deep evidence integration in medical AI systems.
Findings
Current models show significant gaps in evidence integration and guideline generation.
Evaluation reveals the need for improved reasoning and verification capabilities in medical AI.
MedProbeBench provides a comprehensive framework for assessing expert-level medical reasoning.
Abstract
Recent advances in deep research systems enable large language models to retrieve, synthesize, and reason over large-scale external knowledge. In medicine, developing clinical guidelines critically depends on such deep evidence integration. However, existing benchmarks fail to evaluate this capability in realistic workflows requiring multi-step evidence integration and expert-level judgment. To address this gap, we introduce MedProbeBench, the first benchmark leveraging high-quality clinical guidelines as expert-level references. Medical guidelines, with their rigorous standards in neutrality and verifiability, represent the pinnacle of medical expertise and pose substantial challenges for deep research agents. For evaluation, we propose MedProbe-Eval, a comprehensive evaluation framework featuring: (1) Holistic Rubrics with 1,200+ task-adaptive rubric criteria for comprehensive quality…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
