Can Continual Pre-training Bridge the Performance Gap between General-purpose and Specialized Language Models in the Medical Domain?

Niclas Doll; Jasper Schulze Buschhoff; Shalaka Satheesh; Hammam Abdelwahab; H\'ector Allende-Cid; Katrin Klug

arXiv:2604.19394·cs.CL·April 22, 2026

Can Continual Pre-training Bridge the Performance Gap between General-purpose and Specialized Language Models in the Medical Domain?

Niclas Doll, Jasper Schulze Buschhoff, Shalaka Satheesh, Hammam Abdelwahab, H\'ector Allende-Cid, Katrin Klug

PDF

TL;DR

This study demonstrates that continual pre-training and merging of language models with domain-specific data significantly improve their performance in German medical tasks, making smaller models more competitive.

Contribution

It introduces a new methodology for domain adaptation of LLMs in German medical language, including creating a high-quality corpus and merging models to enhance specialization.

Findings

01

Specialized 7B models outperform general-purpose models on German medical benchmarks.

02

Domain adaptation increases win-rate of smaller models against larger ones by approximately 3.5 times.

03

Model merging restores instruction-following but introduces language mixing and verbosity.

Abstract

This paper narrows the performance gap between small, specialized models and significantly larger general-purpose models through domain adaptation via continual pre-training and merging. We address the scarcity of specialized non-English data by constructing a high-quality German medical corpus (FineMed-de) from FineWeb2. This corpus is used to continually pre-train and merge three well-known LLMs (ranging from $7 B$ to $24 B$ parameters), creating the DeFineMed model family. A comprehensive evaluation confirms that specialization dramatically enhances $7 B$ model performance on German medical benchmarks. Furthermore, the pairwise win-rate analysis of the Qwen2.5-based models demonstrates an approximately $3.5$ -fold increase in the win-rate against the much larger Mistral-Small-24B-Instruct through domain adaptation. This evidence positions specialized $7 B$ models as a competitive,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.