TL;DR
This paper investigates whether Masked Language Modeling (MLM) or Causal Language Modeling (CLM) is more effective for pretraining encoders, finding that a combined biphasic approach offers optimal performance and efficiency.
Contribution
It provides a comprehensive large-scale analysis comparing MLM and CLM pretraining, introducing a biphasic training strategy that leverages both objectives for improved results.
Findings
MLM generally outperforms in text representation tasks.
CLM models are more data-efficient and stable during fine-tuning.
Sequential CLM then MLM training yields the best performance within fixed compute budgets.
Abstract
Learning high-quality text representations is fundamental to a wide range of NLP tasks. While encoder pretraining has traditionally relied on Masked Language Modeling (MLM), recent evidence suggests that decoder models pretrained with Causal Language Modeling (CLM) can be effectively repurposed as encoders, often surpassing traditional encoders on text representation benchmarks. However, it remains unclear whether these gains reflect an inherent advantage of the CLM objective or arise from confounding factors such as model and data scale. In this paper, we address this question through a series of large-scale, carefully controlled pretraining ablations, training a total of 38 models ranging from 210 million to 1 billion parameters, and conducting over 15,000 fine-tuning and evaluation runs. We find that while training with MLM generally yields better performance across text…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗MLMvsCLM/210m-mlm40-42kmodel· 3 dl3 dl
- 🤗MLMvsCLM/610m-clm-17k-mlm40-22kmodel· 53 dl53 dl
- 🤗MLMvsCLM/210m-mlm20-42kmodel· 53 dl53 dl
- 🤗MLMvsCLM/610m-clm-40k-mlm40-42kmodel· 1 dl1 dl
- 🤗MLMvsCLM/610m-mlm40-42k-40000model
- 🤗MLMvsCLM/610m-clm-40k-mlm30-42kmodel· 3 dl3 dl
- 🤗MLMvsCLM/610m-clm-10k-mlm40-42kmodel· 2 dl2 dl
- 🤗MLMvsCLM/610m-clm-42k-1000model
- 🤗MLMvsCLM/610m-clm-dec42k-mlm40-54kmodel· 1 dl1 dl
- 🤗MLMvsCLM/610m-clm-40k-mlm50-42kmodel· 1 dl1 dl
Videos
