The Limited Impact of Medical Adaptation of Large Language and Vision-Language Models

Daniel P. Jeong; Pranav Mani; Saurabh Garg; Zachary C. Lipton; Michael Oberst

arXiv:2411.08870·cs.CL·July 1, 2025

The Limited Impact of Medical Adaptation of Large Language and Vision-Language Models

Daniel P. Jeong, Pranav Mani, Saurabh Garg, Zachary C. Lipton, Michael Oberst

PDF

Open Access 1 Repo

TL;DR

This study critically evaluates whether domain-specific pretraining of large language and vision-language models genuinely enhances medical question-answering performance, finding limited or inconsistent improvements over base models.

Contribution

It provides a comprehensive comparison of medical versus base models, revealing that domain adaptation often does not lead to significant performance gains in medical tasks.

Findings

01

Medical models rarely outperform base models in medical QA.

02

Performance improvements are inconsistent and often statistically insignificant.

03

General-purpose models may already possess substantial medical knowledge.

Abstract

Several recent works seek to adapt general-purpose large language models (LLMs) and vision-language models (VLMs) for medical applications through continued pretraining on publicly available biomedical corpora. These works typically claim that such domain-adaptive pretraining improves performance on various downstream medical tasks, such as answering medical exam questions. In this paper, we compare ten "medical" LLMs and two VLMs against their corresponding base models, arriving at a different conclusion: all medical VLMs and nearly all medical LLMs fail to consistently improve over their base models in the zero-/few-shot prompting and supervised fine-tuning regimes for medical question answering (QA). For instance, on clinical-note-based QA tasks in the 3-shot setting, medical LLMs outperform their base models in only 26.7% of cases, reach a (statistical) tie in 16.7% of cases, and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

taekb/eval-medical-dapt
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications

MethodsBalanced Selection