Large Language Models Encode Clinical Knowledge

Karan Singhal; Shekoofeh Azizi; Tao Tu; S. Sara Mahdavi; Jason Wei,; Hyung Won Chung; Nathan Scales; Ajay Tanwani; Heather Cole-Lewis; Stephen; Pfohl; Perry Payne; Martin Seneviratne; Paul Gamble; Chris Kelly; Nathaneal; Scharli; Aakanksha Chowdhery; Philip Mansfield; Blaise Aguera y Arcas; Dale; Webster; Greg S. Corrado; Yossi Matias; Katherine Chou; Juraj Gottweis; Nenad; Tomasev; Yun Liu; Alvin Rajkomar; Joelle Barral; Christopher Semturs; Alan; Karthikesalingam; Vivek Natarajan

arXiv:2212.13138·cs.CL·December 27, 2022·258 cites

Large Language Models Encode Clinical Knowledge

Karan Singhal, Shekoofeh Azizi, Tao Tu, S. Sara Mahdavi, Jason Wei,, Hyung Won Chung, Nathan Scales, Ajay Tanwani, Heather Cole-Lewis, Stephen, Pfohl, Perry Payne, Martin Seneviratne, Paul Gamble, Chris Kelly, Nathaneal, Scharli, Aakanksha Chowdhery, Philip Mansfield

PDF

Open Access 1 Repo 10 Models 4 Datasets

TL;DR

This paper introduces MultiMedQA and HealthSearchQA benchmarks to evaluate clinical knowledge in large language models, demonstrating that instruction tuning improves medical reasoning but still falls short of clinician performance.

Contribution

It presents new benchmarks for clinical question answering and a parameter-efficient instruction prompt tuning method to enhance LLMs' medical understanding.

Findings

01

Flan-PaLM achieves state-of-the-art accuracy on multiple medical QA datasets.

02

Instruction prompt tuning improves medical reasoning and recall.

03

Models still lag behind clinicians in understanding and reasoning.

Abstract

Large language models (LLMs) have demonstrated impressive capabilities in natural language understanding and generation, but the quality bar for medical and clinical applications is high. Today, attempts to assess models' clinical knowledge typically rely on automated evaluations on limited benchmarks. There is no standard to evaluate model predictions and reasoning across a breadth of tasks. To address this, we present MultiMedQA, a benchmark combining six existing open question answering datasets spanning professional medical exams, research, and consumer queries; and HealthSearchQA, a new free-response dataset of medical questions searched online. We propose a framework for human evaluation of model answers along multiple axes including factuality, precision, possible harm, and bias. In addition, we evaluate PaLM (a 540-billion parameter LLM) and its instruction-tuned variant,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

dmis-lab/olaph
pytorch

Models

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Artificial Intelligence in Healthcare and Education · Text Readability and Simplification

MethodsPathways Language Model