BERnaT: Basque Encoders for Representing Natural Textual Diversity

Ekhi Azurmendi; Joseba Fernandez de Landa; Jaione Bengoetxea; Maite Heredia; Julen Etxaniz; Mikel Zubillaga; Ander Soraluze; Aitor Soroa

arXiv:2512.03903·cs.CL·March 24, 2026

BERnaT: Basque Encoders for Representing Natural Textual Diversity

Ekhi Azurmendi, Joseba Fernandez de Landa, Jaione Bengoetxea, Maite Heredia, Julen Etxaniz, Mikel Zubillaga, Ander Soraluze, Aitor Soroa

PDF

Open Access 9 Models 1 Datasets

TL;DR

This paper emphasizes the importance of linguistic diversity in language models, introduces new Basque corpora and models capturing diverse language varieties, and demonstrates improved performance on varied NLU tasks.

Contribution

It constructs new Basque corpora and trains models on diverse language data, enhancing linguistic generalization and inclusivity in language modeling.

Findings

01

Models trained on diverse data outperform standard-only models.

02

Diverse training improves performance across all NLU tasks.

03

Standard benchmark accuracy remains unaffected.

Abstract

Language models depend on massive text corpora that are often filtered for quality, a process that can unintentionally exclude non-standard linguistic varieties, reduce model robustness and reinforce representational biases. In this paper, we argue that language models should aim to capture the full spectrum of language variation (dialectal, historical, informal, etc.) rather than relying solely on standardized text. Focusing on the Basque language, we construct new corpora combining standard, social media, and historical sources, and pre-train the BERnaT family of encoder-only models in three configurations: standard, diverse, and combined. We further propose an evaluation framework that separates Natural Language Understanding (NLU) tasks into standard and diverse subsets to assess linguistic generalization. Results show that models trained on both standard and diverse data…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

Datasets

HiTZ/BERnaT-Diverse
dataset· 27 dl
27 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAuthorship Attribution and Profiling · Natural Language Processing Techniques · Text Readability and Simplification