MrBERT: Modern Multilingual Encoders via Vocabulary, Domain, and Dimensional Adaptation
Daniel Tamayo, I\~naki Lacunza, Paula Rivera-Hidalgo, Severino Da Dalt, Javier Aula-Blasco, Aitor Gonzalez-Agirre, Marta Villegas

TL;DR
MrBERT is a versatile multilingual encoder family that achieves state-of-the-art results in specific languages and domains while optimizing for efficiency through flexible vector representations.
Contribution
Introduces MrBERT, a multilingual encoder family with adaptive vocabulary, domain-specific tuning, and efficient vector representations for improved performance and deployment.
Findings
State-of-the-art results on Catalan and Spanish tasks
Robust performance in biomedical and legal domains
Reduced inference and storage costs
Abstract
We introduce MrBERT, a family of 150M-300M parameter encoders built on the ModernBERT architecture and pre-trained on 35 languages and code. Through targeted adaptation, this model family achieves state-of-the-art results on Catalan- and Spanish-specific tasks, while establishing robust performance across specialized biomedical and legal domains. To bridge the gap between research and production, we incorporate Matryoshka Representation Learning (MRL), enabling flexible vector sizing that significantly reduces inference and storage costs. Ultimately, the MrBERT family demonstrates that modern encoder architectures can be optimized for both localized linguistic excellence and efficient, high-stakes domain specialization. We open source the complete model family on Huggingface.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗gplsi/Aitana-tourism-mb-encoder-1.0model· 25 dl25 dl
- 🤗BSC-LT/MrBERTmodel· 1.5k dl· ♡ 71.5k dl♡ 7
- 🤗BSC-LT/MrBERT-esmodel· 823 dl· ♡ 4823 dl♡ 4
- 🤗BSC-LT/MrBERT-biomedmodel· 246 dl246 dl
- 🤗BSC-LT/MrBERT-legalmodel· 34 dl34 dl
- 🤗BSC-LT/MrBERT-camodel· 21 dl· ♡ 121 dl♡ 1
- 🤗SINAI/ALIA-MrBERT-es-legal-embeddingsmodel· 61 dl61 dl
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Generative Adversarial Networks and Image Synthesis · Artificial Intelligence in Healthcare and Education
