Pre-training Data Quality and Quantity for a Low-Resource Language: New   Corpus and BERT Models for Maltese

Kurt Micallef; Albert Gatt; Marc Tanti; Lonneke van der Plas; Claudia; Borg

arXiv:2205.10517·cs.CL·August 9, 2022

Pre-training Data Quality and Quantity for a Low-Resource Language: New Corpus and BERT Models for Maltese

Kurt Micallef, Albert Gatt, Marc Tanti, Lonneke van der Plas, Claudia, Borg

PDF

1 Repo

TL;DR

This paper introduces a new Maltese corpus and pre-trained BERT models, demonstrating that domain-diverse data and smaller corpora can significantly improve NLP task performance for low-resource languages.

Contribution

It presents a new Maltese corpus, two pre-trained BERT models, and analyzes the impact of data size and domain on low-resource language NLP performance.

Findings

01

Mixture of domains improves performance over Wikipedia-only data

02

Smaller corpora can still achieve significant performance gains

03

Monolingual BERT outperforms or matches mBERT on Maltese tasks

Abstract

Multilingual language models such as mBERT have seen impressive cross-lingual transfer to a variety of languages, but many languages remain excluded from these models. In this paper, we analyse the effect of pre-training with monolingual data for a low-resource language that is not included in mBERT -- Maltese -- with a range of pre-training set ups. We conduct evaluations with the newly pre-trained models on three morphosyntactic tasks -- dependency parsing, part-of-speech tagging, and named-entity recognition -- and one semantic classification task -- sentiment analysis. We also present a newly created corpus for Maltese, and determine the effect that the pre-training data size and domain have on the downstream performance. Our results show that using a mixture of pre-training domains is often superior to using Wikipedia text only. We also find that a fraction of this corpus is enough…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

mlrs/bertu
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsAttention Is All You Need · Linear Layer · Weight Decay · Dense Connections · Linear Warmup With Linear Decay · Dropout · Attention Dropout · WordPiece · Refunds@Expedia|||How do I get a full refund from Expedia? · Adam