Pretraining and Benchmarking Modern Encoders for Latvian

Arturs Znotins

arXiv:2603.15005·cs.CL·March 17, 2026

Pretraining and Benchmarking Modern Encoders for Latvian

Arturs Znotins

PDF

Open Access 1 Video

TL;DR

This paper pretrains and benchmarks Latvian-specific encoder models using recent transformer architectures, demonstrating competitive performance and releasing resources to advance Latvian NLP research and applications.

Contribution

It introduces Latvian-specific encoder models based on modern architectures and evaluates their performance across multiple benchmarks, filling a gap in low-resource language NLP.

Findings

01

lv-deberta-base outperforms larger multilingual models

02

Models are competitive with existing Latvian encoders

03

Resources are publicly released for further research

Abstract

Encoder-only transformers remain essential for practical NLP tasks. While recent advances in multilingual models have improved cross-lingual capabilities, low-resource languages such as Latvian remain underrepresented in pretraining corpora, and few monolingual Latvian encoders currently exist. We address this gap by pretraining a suite of Latvian-specific encoders based on RoBERTa, DeBERTaV3, and ModernBERT architectures, including long-context variants, and evaluating them across a diverse set of Latvian diagnostic and linguistic benchmarks. Our models are competitive with existing monolingual and multilingual encoders while benefiting from recent architectural and efficiency advances. Our best model, lv-deberta-base (111M parameters), achieves the strongest overall performance, outperforming larger multilingual baselines and prior Latvian-specific encoders. We release all pretrained…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Pretraining and Benchmarking Modern Encoders for Latvian· underline

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling · Big Data and Digital Economy