RobBERT-2022: Updating a Dutch Language Model to Account for Evolving   Language Use

Pieter Delobelle; Thomas Winters; Bettina Berendt

arXiv:2211.08192·cs.CL·November 16, 2022·5 cites

RobBERT-2022: Updating a Dutch Language Model to Account for Evolving Language Use

Pieter Delobelle, Thomas Winters, Bettina Berendt

PDF

Open Access 3 Models

TL;DR

This paper updates the Dutch RobBERT language model by incorporating recent vocabulary and further pre-training, demonstrating improved performance on language tasks that benefit from current language use.

Contribution

The paper introduces a method for updating a pre-trained language model with new vocabulary and additional training to better reflect evolving language.

Findings

01

Significant performance improvements on certain language tasks

02

Updated tokenizer with new high-frequency tokens

03

Demonstrates benefits of continual language model updating

Abstract

Large transformer-based language models, e.g. BERT and GPT-3, outperform previous architectures on most natural language processing tasks. Such language models are first pre-trained on gigantic corpora of text and later used as base-model for finetuning on a particular task. Since the pre-training step is usually not repeated, base models are not up-to-date with the latest information. In this paper, we update RobBERT, a RoBERTa-based state-of-the-art Dutch language model, which was trained in 2019. First, the tokenizer of RobBERT is updated to include new high-frequent tokens present in the latest Dutch OSCAR corpus, e.g. corona-related words. Then we further pre-train the RobBERT model using this dataset. To evaluate if our new model is a plug-in replacement for RobBERT, we introduce two additional criteria based on concept drift of existing tokens and alignment for novel tokens.We…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Data Stream Mining Techniques

Methods15 Ways to Contact How can i speak to someone at Delta Airlines · Attention Is All You Need · Linear Layer · Cosine Annealing · Softmax · Adam · Weight Decay · Residual Connection · Byte Pair Encoding · Dropout