Large Language Models in the Clinic: A Comprehensive Benchmark

Fenglin Liu; Zheng Li; Hongjian Zhou; Qingyu Yin; Jingfeng Yang,; Xianfeng Tang; Chen Luo; Ming Zeng; Haoming Jiang; Yifan Gao; Priyanka Nigam,; Sreyashi Nag; Bing Yin; Yining Hua; Xuan Zhou; Omid Rohanian; Anshul Thakur,; Lei Clifton; David A. Clifton

arXiv:2405.00716·cs.CL·October 17, 2024·6 cites

Large Language Models in the Clinic: A Comprehensive Benchmark

Fenglin Liu, Zheng Li, Hongjian Zhou, Qingyu Yin, Jingfeng Yang,, Xianfeng Tang, Chen Luo, Ming Zeng, Haoming Jiang, Yifan Gao, Priyanka Nigam,, Sreyashi Nag, Bing Yin, Yining Hua, Xuan Zhou, Omid Rohanian, Anshul Thakur,, Lei Clifton, David A. Clifton

PDF

Open Access 1 Repo

TL;DR

This paper introduces ClinicBench, a comprehensive benchmark for evaluating large language models in clinical settings across diverse tasks, including open-ended questions and complex decision-making, with extensive evaluation and expert assessment.

Contribution

The paper constructs a new benchmark with diverse datasets and tasks, including novel complex clinical scenarios, to evaluate LLMs' clinical usefulness comprehensively.

Findings

01

Twenty-two LLMs evaluated under zero-shot and few-shot settings.

02

Clinical usefulness assessed by medical experts.

03

Benchmark data publicly available.

Abstract

The adoption of large language models (LLMs) to assist clinicians has attracted remarkable attention. Existing works mainly adopt the close-ended question-answering (QA) task with answer options for evaluation. However, many clinical decisions involve answering open-ended questions without pre-set options. To better understand LLMs in the clinic, we construct a benchmark ClinicBench. We first collect eleven existing datasets covering diverse clinical language generation, understanding, and reasoning tasks. Furthermore, we construct six novel datasets and clinical tasks that are complex but common in real-world practice, e.g., open-ended decision-making, long document processing, and emerging drug analysis. We conduct an extensive evaluation of twenty-two LLMs under both zero-shot and few-shot settings. Finally, we invite medical experts to evaluate the clinical usefulness of LLMs. The…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

ai-in-health/clinicbench
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsArtificial Intelligence in Healthcare · Artificial Intelligence in Healthcare and Education