CliBench: A Multifaceted and Multigranular Evaluation of Large Language   Models for Clinical Decision Making

Mingyu Derek Ma; Chenchen Ye; Yu Yan; Xiaoxuan Wang; Peipei Ping,; Timothy S Chang; Wei Wang

arXiv:2406.09923·cs.CL·October 15, 2024·2 cites

CliBench: A Multifaceted and Multigranular Evaluation of Large Language Models for Clinical Decision Making

Mingyu Derek Ma, Chenchen Ye, Yu Yan, Xiaoxuan Wang, Peipei Ping,, Timothy S Chang, Wei Wang

PDF

Open Access

TL;DR

This paper introduces CliBench, a comprehensive benchmark from MIMIC IV to evaluate large language models' performance in diverse, real-world clinical decision-making tasks, highlighting their potential and current limitations.

Contribution

We developed CliBench, a multifaceted benchmark for assessing LLMs in clinical diagnosis, covering various specialties and tasks with structured evaluation metrics.

Findings

01

LLMs show promise in clinical diagnosis but have notable limitations.

02

Structured, multi-granular evaluation reveals strengths and weaknesses of LLMs.

03

Benchmark facilitates realistic assessment of LLMs in healthcare applications.

Abstract

The integration of Artificial Intelligence (AI), especially Large Language Models (LLMs), into the clinical diagnosis process offers significant potential to improve the efficiency and accessibility of medical care. While LLMs have shown some promise in the medical domain, their application in clinical diagnosis remains underexplored, especially in real-world clinical practice, where highly sophisticated, patient-specific decisions need to be made. Current evaluations of LLMs in this field are often narrow in scope, focusing on specific diseases or specialties and employing simplified diagnostic tasks. To bridge this gap, we introduce CliBench, a novel benchmark developed from the MIMIC IV dataset, offering a comprehensive and realistic assessment of LLMs' capabilities in clinical diagnosis. This benchmark not only covers diagnoses from a diverse range of medical cases across various…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsBiomedical Text Mining and Ontologies · Machine Learning in Healthcare · Artificial Intelligence in Healthcare and Education