TL;DR
This paper introduces a new Maltese corpus and pre-trained BERT models, demonstrating that domain-diverse data and smaller corpora can significantly improve NLP task performance for low-resource languages.
Contribution
It presents a new Maltese corpus, two pre-trained BERT models, and analyzes the impact of data size and domain on low-resource language NLP performance.
Findings
Mixture of domains improves performance over Wikipedia-only data
Smaller corpora can still achieve significant performance gains
Monolingual BERT outperforms or matches mBERT on Maltese tasks
Abstract
Multilingual language models such as mBERT have seen impressive cross-lingual transfer to a variety of languages, but many languages remain excluded from these models. In this paper, we analyse the effect of pre-training with monolingual data for a low-resource language that is not included in mBERT -- Maltese -- with a range of pre-training set ups. We conduct evaluations with the newly pre-trained models on three morphosyntactic tasks -- dependency parsing, part-of-speech tagging, and named-entity recognition -- and one semantic classification task -- sentiment analysis. We also present a newly created corpus for Maltese, and determine the effect that the pre-training data size and domain have on the downstream performance. Our results show that using a mixture of pre-training domains is often superior to using Wikipedia text only. We also find that a fraction of this corpus is enough…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsAttention Is All You Need · Linear Layer · Weight Decay · Dense Connections · Linear Warmup With Linear Decay · Dropout · Attention Dropout · WordPiece · Refunds@Expedia|||How do I get a full refund from Expedia? · Adam
