Training dataset and dictionary sizes matter in BERT models: the case of   Baltic languages

Matej Ul\v{c}ar; Marko Robnik-\v{S}ikonja

arXiv:2112.10553·cs.CL·December 21, 2021

Training dataset and dictionary sizes matter in BERT models: the case of Baltic languages

Matej Ul\v{c}ar, Marko Robnik-\v{S}ikonja

PDF

Open Access

TL;DR

This study demonstrates that training dataset size and dictionary scope significantly impact BERT model performance for Baltic languages, with new models outperforming existing ones across multiple NLP tasks.

Contribution

It introduces new monolingual and multilingual BERT-like models for Baltic languages and evaluates their effectiveness compared to existing models.

Findings

01

New models outperform existing models on all tested tasks.

02

Large training datasets improve model performance.

03

Focusing on a single language benefits downstream NLP tasks.

Abstract

Large pretrained masked language models have become state-of-the-art solutions for many NLP problems. While studies have shown that monolingual models produce better results than multilingual models, the training datasets must be sufficiently large. We trained a trilingual LitLat BERT-like model for Lithuanian, Latvian, and English, and a monolingual Est-RoBERTa model for Estonian. We evaluate their performance on four downstream tasks: named entity recognition, dependency parsing, part-of-speech tagging, and word analogy. To analyze the importance of focusing on a single language and the importance of a large training set, we compare created models with existing monolingual and multilingual BERT models for Estonian, Latvian, and Lithuanian. The results show that the newly created LitLat BERT and Est-RoBERTa models improve the results of existing models on all tested tasks in most…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling · Text Readability and Simplification

MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Multi-Head Attention · Attention Is All You Need · Linear Layer · Layer Normalization · Residual Connection · Softmax · WordPiece · Adam · Linear Warmup With Linear Decay