Can Continual Pre-training Bridge the Performance Gap between General-purpose and Specialized Language Models in the Medical Domain?
Niclas Doll, Jasper Schulze Buschhoff, Shalaka Satheesh, Hammam Abdelwahab, H\'ector Allende-Cid, Katrin Klug

TL;DR
This study demonstrates that continual pre-training and merging of language models with domain-specific data significantly improve their performance in German medical tasks, making smaller models more competitive.
Contribution
It introduces a new methodology for domain adaptation of LLMs in German medical language, including creating a high-quality corpus and merging models to enhance specialization.
Findings
Specialized 7B models outperform general-purpose models on German medical benchmarks.
Domain adaptation increases win-rate of smaller models against larger ones by approximately 3.5 times.
Model merging restores instruction-following but introduces language mixing and verbosity.
Abstract
This paper narrows the performance gap between small, specialized models and significantly larger general-purpose models through domain adaptation via continual pre-training and merging. We address the scarcity of specialized non-English data by constructing a high-quality German medical corpus (FineMed-de) from FineWeb2. This corpus is used to continually pre-train and merge three well-known LLMs (ranging from to parameters), creating the DeFineMed model family. A comprehensive evaluation confirms that specialization dramatically enhances model performance on German medical benchmarks. Furthermore, the pairwise win-rate analysis of the Qwen2.5-based models demonstrates an approximately -fold increase in the win-rate against the much larger Mistral-Small-24B-Instruct through domain adaptation. This evidence positions specialized models as a competitive,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
