Can Generalist Foundation Models Outcompete Special-Purpose Tuning? Case   Study in Medicine

Harsha Nori; Yin Tat Lee; Sheng Zhang; Dean Carignan; Richard Edgar,; Nicolo Fusi; Nicholas King; Jonathan Larson; Yuanzhi Li; Weishung Liu,; Renqian Luo; Scott Mayer McKinney; Robert Osazuwa Ness; Hoifung Poon; Tao; Qin; Naoto Usuyama; Chris White; Eric Horvitz

arXiv:2311.16452·cs.CL·November 29, 2023·166 cites

Can Generalist Foundation Models Outcompete Special-Purpose Tuning? Case Study in Medicine

Harsha Nori, Yin Tat Lee, Sheng Zhang, Dean Carignan, Richard Edgar,, Nicolo Fusi, Nicholas King, Jonathan Larson, Yuanzhi Li, Weishung Liu,, Renqian Luo, Scott Mayer McKinney, Robert Osazuwa Ness, Hoifung Poon, Tao, Qin, Naoto Usuyama, Chris White, Eric Horvitz

PDF

Open Access 2 Repos 1 Video

TL;DR

This study demonstrates that with carefully designed prompt engineering, GPT-4 can surpass specialist models in medical benchmarks and generalize effectively across various domains, challenging assumptions about the necessity of domain-specific fine-tuning.

Contribution

The paper introduces Medprompt, a general-purpose prompt engineering strategy that enables GPT-4 to achieve state-of-the-art results in medical and other domain-specific benchmarks without domain-specific training.

Findings

01

GPT-4 with Medprompt outperforms specialist models like Med-PaLM 2.

02

Medprompt reduces error rates by 27% on MedQA.

03

The approach generalizes well across multiple domains.

Abstract

Generalist foundation models such as GPT-4 have displayed surprising capabilities in a wide variety of domains and tasks. Yet, there is a prevalent assumption that they cannot match specialist capabilities of fine-tuned models. For example, most explorations to date on medical competency benchmarks have leveraged domain-specific training, as exemplified by efforts on BioGPT and Med-PaLM. We build on a prior study of GPT-4's capabilities on medical challenge benchmarks in the absence of special training. Rather than using simple prompting to highlight the model's out-of-the-box capabilities, we perform a systematic exploration of prompt engineering. We find that prompting innovation can unlock deeper specialist capabilities and show that GPT-4 easily tops prior leading results for medical benchmarks. The prompting methods we explore are general purpose, and make no specific use of domain…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

Keynote: Multimodal Generative AI for Precision Health | Microsoft Research Forum· youtube

Taxonomy

TopicsArtificial Intelligence in Healthcare and Education · Machine Learning in Healthcare · Topic Modeling

MethodsMulti-Head Attention · Attention Is All You Need · Dense Connections · Dropout · Byte Pair Encoding · Softmax · Layer Normalization · Linear Layer · Position-Wise Feed-Forward Layer · Absolute Position Encodings