TL;DR
This paper introduces a novel approach where temporarily switching from Masked Language Modeling to Causal Language Modeling during encoder pretraining enhances downstream biomedical NLP performance.
Contribution
The study demonstrates that a CLM detour during encoder pretraining improves biomedical task results and analyzes the underlying representational changes involved.
Findings
CLM detour outperforms MLM baselines on biomedical tasks.
Dense supervision from CLM affects lower transformer layers more.
Freezing low layers during CLM removes benefits, while freezing mid layers preserves them.
Abstract
When adapting an encoder to a new domain, the standard approach is to continue training with Masked Language Modeling (MLM). We show that temporarily switching to Causal Language Modeling (CLM) followed by a short MLM decay improves downstream performance. On biomedical texts with ModernBERT, this CLM detour outperforms MLM baselines trained on identical data and compute across 8 French and 11 English biomedical tasks, by +1.2-2.8pp and +0.3-0.8pp respectively, depending on model size. We investigate the reasons for these gains. We find that CLM's dense supervision impacts low transformer layers (0-7) far more than MLM does. Freezing low layers during CLM eliminates the downstream benefit; freezing mid layers preserves it. The representational changes persist through the MLM decay phase, even when it matches the CLM phase in length, and they scale with model capacity. We release…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
