Large Language Models Encode Clinical Knowledge
Karan Singhal, Shekoofeh Azizi, Tao Tu, S. Sara Mahdavi, Jason Wei,, Hyung Won Chung, Nathan Scales, Ajay Tanwani, Heather Cole-Lewis, Stephen, Pfohl, Perry Payne, Martin Seneviratne, Paul Gamble, Chris Kelly, Nathaneal, Scharli, Aakanksha Chowdhery, Philip Mansfield

TL;DR
This paper introduces MultiMedQA and HealthSearchQA benchmarks to evaluate clinical knowledge in large language models, demonstrating that instruction tuning improves medical reasoning but still falls short of clinician performance.
Contribution
It presents new benchmarks for clinical question answering and a parameter-efficient instruction prompt tuning method to enhance LLMs' medical understanding.
Findings
Flan-PaLM achieves state-of-the-art accuracy on multiple medical QA datasets.
Instruction prompt tuning improves medical reasoning and recall.
Models still lag behind clinicians in understanding and reasoning.
Abstract
Large language models (LLMs) have demonstrated impressive capabilities in natural language understanding and generation, but the quality bar for medical and clinical applications is high. Today, attempts to assess models' clinical knowledge typically rely on automated evaluations on limited benchmarks. There is no standard to evaluate model predictions and reasoning across a breadth of tasks. To address this, we present MultiMedQA, a benchmark combining six existing open question answering datasets spanning professional medical exams, research, and consumer queries; and HealthSearchQA, a new free-response dataset of medical questions searched online. We propose a framework for human evaluation of model answers along multiple axes including factuality, precision, possible harm, and bias. In addition, we evaluate PaLM (a 540-billion parameter LLM) and its instruction-tuned variant,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗aaditya/Llama3-OpenBioLLM-8Bmodel· 39k dl· ♡ 23639k dl♡ 236
- 🤗aaditya/Llama3-OpenBioLLM-70Bmodel· 3.6k dl· ♡ 5033.6k dl♡ 503
- 🤗LiteLLMs/Llama3-OpenBioLLM-8B-GGUFmodel· 34 dl· ♡ 134 dl♡ 1
- 🤗disi-unibo-nlp/MedGENIE-fid-flan-t5-base-medqamodel· 5 dl5 dl
- 🤗matteocap/OpenBioLLM-Llama3-8B_safetensorsmodel· 1 dl1 dl
- 🤗LoneStriker/OpenBioLLM-Llama3-8B-GGUFmodel· 35 dl· ♡ 135 dl♡ 1
- 🤗LoneStriker/OpenBioLLM-Llama3-8B-3.0bpw-h6-exl2model· 2 dl2 dl
- 🤗LoneStriker/OpenBioLLM-Llama3-8B-4.0bpw-h6-exl2model· 1 dl1 dl
- 🤗LoneStriker/OpenBioLLM-Llama3-8B-5.0bpw-h6-exl2model· 2 dl2 dl
- 🤗LoneStriker/OpenBioLLM-Llama3-8B-6.0bpw-h6-exl2model· 2 dl2 dl
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Artificial Intelligence in Healthcare and Education · Text Readability and Simplification
MethodsPathways Language Model
