mmBERT: A Modern Multilingual Encoder with Annealed Language Learning

Marc Marone; Orion Weller; William Fleshman; Eugene Yang; Dawn Lawrie; Benjamin Van Durme

arXiv:2509.06888·cs.CL·September 9, 2025·2 cites

mmBERT: A Modern Multilingual Encoder with Annealed Language Learning

Marc Marone, Orion Weller, William Fleshman, Eugene Yang, Dawn Lawrie, Benjamin Van Durme

PDF

Open Access 10 Models 5 Datasets 3 Reviews

TL;DR

mmBERT is a multilingual encoder model trained on over 1800 languages, introducing novel training strategies that significantly improve performance on classification and retrieval tasks across resource levels.

Contribution

The paper presents mmBERT, a new multilingual encoder with innovative training techniques, including inverse mask scheduling and phased low-resource language inclusion, enhancing performance across diverse languages.

Findings

01

mmBERT achieves performance comparable to leading models like o3 and Gemini 2.5 Pro.

02

Inclusion of low-resource languages during decay phase boosts model performance.

03

The model outperforms previous models on classification and retrieval tasks for both high and low-resource languages.

Abstract

Encoder-only languages models are frequently used for a variety of standard machine learning tasks, including classification and retrieval. However, there has been a lack of recent research for encoder models, especially with respect to multilingual models. We introduce mmBERT, an encoder-only language model pretrained on 3T tokens of multilingual text in over 1800 languages. To build mmBERT we introduce several novel elements, including an inverse mask ratio schedule and an inverse temperature sampling ratio. We add over 1700 low-resource languages to the data mix only during the decay phase, showing that it boosts performance dramatically and maximizes the gains from the relatively small amount of training data. Despite only including these low-resource languages in the short decay phase we achieve similar classification performance to models like OpenAI's o3 and Google's Gemini 2.5…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 6Confidence 4

Strengths

• Large scale experiments: 3T tokens with 1800 languages • A 3-phrase recipe for training MMBERT: (1) inverse mask schedule; (2) annealing language schedule; (3) increasing number of languages in gradually in 3 phrases.

Weaknesses

• Mostly known tricks; locating desired combination of them.

Reviewer 02Rating 4Confidence 4

Strengths

- trained to support large set of languages (1800+) - Outperforms previous encoder models (XLM-R, EuroBERT) - Method and training strategies are clearly described - The model and data are claimed to be fully open-sourced - strong multilingual encoder model released as backbone replacing xlm-r etc.

Weaknesses

1. The effectiveness of the inverse masking schedule and cascading annealed language learning schedule are not well demonstrated through ablation studies. If these methods are claimed as novel contributions, it's important to thoroughly prove their effectiveness. 2. It is unclear how mmBERT compares to using a small decoder-only LLM like Qwen3-0.6B as a backbone for text embedding and classification. (If I recall correctly, decoder-only LLMs can enable bidirectional attention during downstream

Reviewer 03Rating 8Confidence 3

Strengths

The annealed language learning is well-motivated and seems to be useful. In general, the large amount of language covered is a nice strength of the paper. mmBERT shows trong empirical results across diverse benchmarks Overall, this is a meaningful contribution to an underserved area (multilingual encoders)

Weaknesses

The authors acknowledge they couldn't ablate inverse masking due to compute, but this is an interesting contribution that could be tested more directly. Only 2 languages are tested for demonstrating the benefit of the annealed language learning contribution. In general, with many languages covered it is difficult to assess whether the encoder really does a decent job on all languages and the evaluation over the long tail of languages is very superficial. The tables of results do not report var

Code & Models

Models

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Natural Language Processing Techniques