Towards Expert-Level Medical Question Answering with Large Language Models
Karan Singhal, Tao Tu, Juraj Gottweis, Rory Sayres, Ellery Wulczyn, Le, Hou, Kevin Clark, Stephen Pfohl, Heather Cole-Lewis, Darlene Neal, Mike, Schaekermann, Amy Wang, Mohamed Amin, Sami Lachgar, Philip Mansfield, Sushant, Prakash, Bradley Green, Ewa Dominowska

TL;DR
Med-PaLM 2 significantly advances medical question answering by leveraging improved language models, domain-specific fine-tuning, and novel prompting strategies, achieving near or surpassing physician-level performance on multiple datasets and evaluations.
Contribution
This work introduces Med-PaLM 2, a new model that combines base LLM improvements, medical fine-tuning, and ensemble prompting to achieve state-of-the-art medical QA performance.
Findings
Med-PaLM 2 scored 86.5% on MedQA, surpassing previous models.
Physicians preferred Med-PaLM 2 answers over physicians' answers on most axes.
Significant improvements on adversarial and long-form medical questions.
Abstract
Recent artificial intelligence (AI) systems have reached milestones in "grand challenges" ranging from Go to protein-folding. The capability to retrieve medical knowledge, reason over it, and answer medical questions comparably to physicians has long been viewed as one such grand challenge. Large language models (LLMs) have catalyzed significant progress in medical question answering; Med-PaLM was the first model to exceed a "passing" score in US Medical Licensing Examination (USMLE) style questions with a score of 67.2% on the MedQA dataset. However, this and other prior work suggested significant room for improvement, especially when models' answers were compared to clinicians' answers. Here we present Med-PaLM 2, which bridges these gaps by leveraging a combination of base LLM improvements (PaLM 2), medical domain finetuning, and prompting strategies including a novel ensemble…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗aaditya/Llama3-OpenBioLLM-8Bmodel· 39k dl· ♡ 23639k dl♡ 236
- 🤗aaditya/Llama3-OpenBioLLM-70Bmodel· 3.6k dl· ♡ 5033.6k dl♡ 503
- 🤗LiteLLMs/Llama3-OpenBioLLM-8B-GGUFmodel· 34 dl· ♡ 134 dl♡ 1
- 🤗matteocap/OpenBioLLM-Llama3-8B_safetensorsmodel· 1 dl1 dl
- 🤗LoneStriker/OpenBioLLM-Llama3-8B-GGUFmodel· 35 dl· ♡ 135 dl♡ 1
- 🤗LoneStriker/OpenBioLLM-Llama3-8B-3.0bpw-h6-exl2model· 2 dl2 dl
- 🤗LoneStriker/OpenBioLLM-Llama3-8B-4.0bpw-h6-exl2model· 1 dl1 dl
- 🤗LoneStriker/OpenBioLLM-Llama3-8B-5.0bpw-h6-exl2model· 2 dl2 dl
- 🤗LoneStriker/OpenBioLLM-Llama3-8B-6.0bpw-h6-exl2model· 2 dl2 dl
- 🤗LoneStriker/OpenBioLLM-Llama3-8B-8.0bpw-h8-exl2model· 2 dl2 dl
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsArtificial Intelligence in Healthcare and Education · Topic Modeling · Machine Learning in Healthcare
MethodsBalanced Selection
