Should We Still Pretrain Encoders with Masked Language Modeling?

Hippolyte Gisserot-Boukhlef; Nicolas Boizard; Manuel Faysse; Duarte M. Alves; Emmanuel Malherbe; Andr\'e F. T. Martins; C\'eline Hudelot; Pierre Colombo

arXiv:2507.00994·cs.CL·May 6, 2026

Should We Still Pretrain Encoders with Masked Language Modeling?

Hippolyte Gisserot-Boukhlef, Nicolas Boizard, Manuel Faysse, Duarte M. Alves, Emmanuel Malherbe, Andr\'e F. T. Martins, C\'eline Hudelot, Pierre Colombo

PDF

2 Repos 50 Models 1 Video

TL;DR

This paper investigates whether Masked Language Modeling (MLM) or Causal Language Modeling (CLM) is more effective for pretraining encoders, finding that a combined biphasic approach offers optimal performance and efficiency.

Contribution

It provides a comprehensive large-scale analysis comparing MLM and CLM pretraining, introducing a biphasic training strategy that leverages both objectives for improved results.

Findings

01

MLM generally outperforms in text representation tasks.

02

CLM models are more data-efficient and stable during fine-tuning.

03

Sequential CLM then MLM training yields the best performance within fixed compute budgets.

Abstract

Learning high-quality text representations is fundamental to a wide range of NLP tasks. While encoder pretraining has traditionally relied on Masked Language Modeling (MLM), recent evidence suggests that decoder models pretrained with Causal Language Modeling (CLM) can be effectively repurposed as encoders, often surpassing traditional encoders on text representation benchmarks. However, it remains unclear whether these gains reflect an inherent advantage of the CLM objective or arise from confounding factors such as model and data scale. In this paper, we address this question through a series of large-scale, carefully controlled pretraining ablations, training a total of 38 models ranging from 210 million to 1 billion parameters, and conducting over 15,000 fine-tuning and evaluation runs. We find that while training with MLM generally yields better performance across text…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Models

Videos

Should We Still Pretrain Encoders with Masked Language Modeling?· slideslive