Training dataset and dictionary sizes matter in BERT models: the case of Baltic languages
Matej Ul\v{c}ar, Marko Robnik-\v{S}ikonja

TL;DR
This study demonstrates that training dataset size and dictionary scope significantly impact BERT model performance for Baltic languages, with new models outperforming existing ones across multiple NLP tasks.
Contribution
It introduces new monolingual and multilingual BERT-like models for Baltic languages and evaluates their effectiveness compared to existing models.
Findings
New models outperform existing models on all tested tasks.
Large training datasets improve model performance.
Focusing on a single language benefits downstream NLP tasks.
Abstract
Large pretrained masked language models have become state-of-the-art solutions for many NLP problems. While studies have shown that monolingual models produce better results than multilingual models, the training datasets must be sufficiently large. We trained a trilingual LitLat BERT-like model for Lithuanian, Latvian, and English, and a monolingual Est-RoBERTa model for Estonian. We evaluate their performance on four downstream tasks: named entity recognition, dependency parsing, part-of-speech tagging, and word analogy. To analyze the importance of focusing on a single language and the importance of a large training set, we compare created models with existing monolingual and multilingual BERT models for Estonian, Latvian, and Lithuanian. The results show that the newly created LitLat BERT and Est-RoBERTa models improve the results of existing models on all tested tasks in most…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Text Readability and Simplification
MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Multi-Head Attention · Attention Is All You Need · Linear Layer · Layer Normalization · Residual Connection · Softmax · WordPiece · Adam · Linear Warmup With Linear Decay
