CliBench: A Multifaceted and Multigranular Evaluation of Large Language Models for Clinical Decision Making
Mingyu Derek Ma, Chenchen Ye, Yu Yan, Xiaoxuan Wang, Peipei Ping,, Timothy S Chang, Wei Wang

TL;DR
This paper introduces CliBench, a comprehensive benchmark from MIMIC IV to evaluate large language models' performance in diverse, real-world clinical decision-making tasks, highlighting their potential and current limitations.
Contribution
We developed CliBench, a multifaceted benchmark for assessing LLMs in clinical diagnosis, covering various specialties and tasks with structured evaluation metrics.
Findings
LLMs show promise in clinical diagnosis but have notable limitations.
Structured, multi-granular evaluation reveals strengths and weaknesses of LLMs.
Benchmark facilitates realistic assessment of LLMs in healthcare applications.
Abstract
The integration of Artificial Intelligence (AI), especially Large Language Models (LLMs), into the clinical diagnosis process offers significant potential to improve the efficiency and accessibility of medical care. While LLMs have shown some promise in the medical domain, their application in clinical diagnosis remains underexplored, especially in real-world clinical practice, where highly sophisticated, patient-specific decisions need to be made. Current evaluations of LLMs in this field are often narrow in scope, focusing on specific diseases or specialties and employing simplified diagnostic tasks. To bridge this gap, we introduce CliBench, a novel benchmark developed from the MIMIC IV dataset, offering a comprehensive and realistic assessment of LLMs' capabilities in clinical diagnosis. This benchmark not only covers diagnoses from a diverse range of medical cases across various…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsBiomedical Text Mining and Ontologies · Machine Learning in Healthcare · Artificial Intelligence in Healthcare and Education
