BERnaT: Basque Encoders for Representing Natural Textual Diversity
Ekhi Azurmendi, Joseba Fernandez de Landa, Jaione Bengoetxea, Maite Heredia, Julen Etxaniz, Mikel Zubillaga, Ander Soraluze, Aitor Soroa

TL;DR
This paper emphasizes the importance of linguistic diversity in language models, introduces new Basque corpora and models capturing diverse language varieties, and demonstrates improved performance on varied NLU tasks.
Contribution
It constructs new Basque corpora and trains models on diverse language data, enhancing linguistic generalization and inclusivity in language modeling.
Findings
Models trained on diverse data outperform standard-only models.
Diverse training improves performance across all NLU tasks.
Standard benchmark accuracy remains unaffected.
Abstract
Language models depend on massive text corpora that are often filtered for quality, a process that can unintentionally exclude non-standard linguistic varieties, reduce model robustness and reinforce representational biases. In this paper, we argue that language models should aim to capture the full spectrum of language variation (dialectal, historical, informal, etc.) rather than relying solely on standardized text. Focusing on the Basque language, we construct new corpora combining standard, social media, and historical sources, and pre-train the BERnaT family of encoder-only models in three configurations: standard, diverse, and combined. We further propose an evaluation framework that separates Natural Language Understanding (NLU) tasks into standard and diverse subsets to assess linguistic generalization. Results show that models trained on both standard and diverse data…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗HiTZ/BERnaT-mediummodel· 13 dl· ♡ 113 dl♡ 1
- 🤗HiTZ/BERnaT-basemodel· 598 dl· ♡ 1598 dl♡ 1
- 🤗HiTZ/BERnaT-largemodel· 6 dl· ♡ 16 dl♡ 1
- 🤗HiTZ/BERnaT-Standard-largemodel· 5 dl5 dl
- 🤗HiTZ/BERnaT-Standard-mediummodel
- 🤗HiTZ/BERnaT-Standard-basemodel· 6 dl6 dl
- 🤗HiTZ/BERnaT-Diverse-mediummodel· 1 dl1 dl
- 🤗HiTZ/BERnaT-Diverse-basemodel· 5 dl5 dl
- 🤗HiTZ/BERnaT-Diverse-largemodel· 2 dl2 dl
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAuthorship Attribution and Profiling · Natural Language Processing Techniques · Text Readability and Simplification
