mmBERT: A Modern Multilingual Encoder with Annealed Language Learning
Marc Marone, Orion Weller, William Fleshman, Eugene Yang, Dawn Lawrie, Benjamin Van Durme

TL;DR
mmBERT is a multilingual encoder model trained on over 1800 languages, introducing novel training strategies that significantly improve performance on classification and retrieval tasks across resource levels.
Contribution
The paper presents mmBERT, a new multilingual encoder with innovative training techniques, including inverse mask scheduling and phased low-resource language inclusion, enhancing performance across diverse languages.
Findings
mmBERT achieves performance comparable to leading models like o3 and Gemini 2.5 Pro.
Inclusion of low-resource languages during decay phase boosts model performance.
The model outperforms previous models on classification and retrieval tasks for both high and low-resource languages.
Abstract
Encoder-only languages models are frequently used for a variety of standard machine learning tasks, including classification and retrieval. However, there has been a lack of recent research for encoder models, especially with respect to multilingual models. We introduce mmBERT, an encoder-only language model pretrained on 3T tokens of multilingual text in over 1800 languages. To build mmBERT we introduce several novel elements, including an inverse mask ratio schedule and an inverse temperature sampling ratio. We add over 1700 low-resource languages to the data mix only during the decay phase, showing that it boosts performance dramatically and maximizes the gains from the relatively small amount of training data. Despite only including these low-resource languages in the short decay phase we achieve similar classification performance to models like OpenAI's o3 and Google's Gemini 2.5…
Peer Reviews
Decision·Submitted to ICLR 2026
• Large scale experiments: 3T tokens with 1800 languages • A 3-phrase recipe for training MMBERT: (1) inverse mask schedule; (2) annealing language schedule; (3) increasing number of languages in gradually in 3 phrases.
• Mostly known tricks; locating desired combination of them.
- trained to support large set of languages (1800+) - Outperforms previous encoder models (XLM-R, EuroBERT) - Method and training strategies are clearly described - The model and data are claimed to be fully open-sourced - strong multilingual encoder model released as backbone replacing xlm-r etc.
1. The effectiveness of the inverse masking schedule and cascading annealed language learning schedule are not well demonstrated through ablation studies. If these methods are claimed as novel contributions, it's important to thoroughly prove their effectiveness. 2. It is unclear how mmBERT compares to using a small decoder-only LLM like Qwen3-0.6B as a backbone for text embedding and classification. (If I recall correctly, decoder-only LLMs can enable bidirectional attention during downstream
The annealed language learning is well-motivated and seems to be useful. In general, the large amount of language covered is a nice strength of the paper. mmBERT shows trong empirical results across diverse benchmarks Overall, this is a meaningful contribution to an underserved area (multilingual encoders)
The authors acknowledge they couldn't ablate inverse masking due to compute, but this is an interesting contribution that could be tested more directly. Only 2 languages are tested for demonstrating the benefit of the annealed language learning contribution. In general, with many languages covered it is difficult to assess whether the encoder really does a decent job on all languages and the evaluation over the long tail of languages is very superficial. The tables of results do not report var
Code & Models
- 🤗jhu-clsp/mmBERT-basemodel· 315k dl· ♡ 199315k dl♡ 199
- 🤗jhu-clsp/mmBERT-smallmodel· 16k dl· ♡ 6716k dl♡ 67
- 🤗jhu-clsp/mmBERT-checkpointsmodel· ♡ 4♡ 4
- 🤗UWV/wimbert-synth-v0model· 5 dl5 dl
- 🤗onnx-community/mmBERT-small-ONNXmodel· 4 dl· ♡ 24 dl♡ 2
- 🤗mykor/mmBERT-base-GGUFmodel· 99 dl99 dl
- 🤗mykor/mmBERT-small-GGUFmodel· 243 dl243 dl
- 🤗onnx-community/mmBERT-base-ONNXmodel· 2 dl2 dl
- 🤗LocalDoc/mmBERT-base-en-azmodel· 24 dl24 dl
- 🤗LocalDoc/mmBERT-small-en-azmodel· 54 dl54 dl
- orionweller/mmBERT-pretraining-data-chunk0dataset· 1.6k dl1.6k dl
- orionweller/mmBERT-pretraining-data-chunk1dataset· 1.2k dl1.2k dl
- orionweller/mmBERT-pretraining-data-chunk2dataset· 284 dl284 dl
- orionweller/mmBERT-pretraining-data-chunk3dataset· 1.8k dl1.8k dl
- orionweller/mmBERT-data-decay-alldataset· 639 dl639 dl
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Natural Language Processing Techniques
