To Adapt or not to Adapt, Rethinking the Value of Medical Knowledge-Aware Large Language Models
Ane G. Domingo-Aldama, Iker De La Iglesia, Maitane Urruela, Aitziber Atutxa, Ander Barrena

TL;DR
This study compares general and clinical large language models on medical question answering tasks, revealing limited benefits of clinical adaptation in English but notable improvements in Spanish with lightweight models.
Contribution
It introduces Marmoka, a family of lightweight clinical LLMs for Spanish, and proposes a perturbation-based benchmark to evaluate model robustness and instruction following.
Findings
Clinical LLMs do not consistently outperform general models in English.
Marmoka models outperform Llama in Spanish clinical tasks.
Both model types show limitations in instruction following and output formatting.
Abstract
BACKGROUND: Recent studies have shown that domain-adapted large language models (LLMs) do not consistently outperform general-purpose counterparts on standard medical benchmarks, raising questions about the need for specialized clinical adaptation. METHODS: We systematically compare general and clinical LLMs on a diverse set of multiple choice clinical question answering tasks in English and Spanish. We introduce a perturbation based evaluation benchmark that probes model robustness, instruction following, and sensitivity to adversarial variations. Our evaluation includes, one-step and two-step question transformations, multi prompt testing and instruction guided assessment. We analyze a range of state-of-the-art clinical models and their general-purpose counterparts, focusing on Llama 3.1-based models. Additionally, we introduce Marmoka, a family of lightweight 8B-parameter clinical…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
