A Decade-Scale Benchmark Evaluating LLMs' Clinical Practice Guidelines Detection and Adherence in Multi-turn Conversations
Andong Tan, Shuyu Dai, Jinglu Wang, Fengtao Zhou, Yan Lu, Xi Wang, Yingcong Chen, Can Yang, Shujie Liu, Hao Chen

TL;DR
This paper introduces CPGBench, a benchmark for evaluating LLMs' ability to detect and adhere to clinical practice guidelines in multi-turn medical conversations, revealing significant gaps in current models' performance.
Contribution
The paper presents the first systematic benchmark, CPGBench, assessing LLMs' clinical guideline detection and adherence, with extensive data and human validation.
Findings
Detection accuracy ranges from 71.1% to 89.6%.
Adherence rates vary from 21.8% to 63.2%.
Large gaps exist between knowing and applying guidelines.
Abstract
Clinical practice guidelines (CPGs) play a pivotal role in ensuring evidence-based decision-making and improving patient outcomes. While Large Language Models (LLMs) are increasingly deployed in healthcare scenarios, it is unclear to which extend LLMs could identify and adhere to CPGs during conversations. To address this gap, we introduce CPGBench, an automated framework benchmarking the clinical guideline detection and adherence capabilities of LLMs in multi-turn conversations. We collect 3,418 CPG documents from 9 countries/regions and 2 international organizations published in the last decade spanning across 24 specialties. From these documents, we extract 32,155 clinical recommendations with corresponding publication institute, date, country, specialty, recommendation strength, evidence level, etc. One multi-turn conversation is generated for each recommendation accordingly to…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsClinical practice guidelines implementation · Electronic Health Records Systems · Artificial Intelligence in Healthcare and Education
