RobBERT-2022: Updating a Dutch Language Model to Account for Evolving Language Use
Pieter Delobelle, Thomas Winters, Bettina Berendt

TL;DR
This paper updates the Dutch RobBERT language model by incorporating recent vocabulary and further pre-training, demonstrating improved performance on language tasks that benefit from current language use.
Contribution
The paper introduces a method for updating a pre-trained language model with new vocabulary and additional training to better reflect evolving language.
Findings
Significant performance improvements on certain language tasks
Updated tokenizer with new high-frequency tokens
Demonstrates benefits of continual language model updating
Abstract
Large transformer-based language models, e.g. BERT and GPT-3, outperform previous architectures on most natural language processing tasks. Such language models are first pre-trained on gigantic corpora of text and later used as base-model for finetuning on a particular task. Since the pre-training step is usually not repeated, base models are not up-to-date with the latest information. In this paper, we update RobBERT, a RoBERTa-based state-of-the-art Dutch language model, which was trained in 2019. First, the tokenizer of RobBERT is updated to include new high-frequent tokens present in the latest Dutch OSCAR corpus, e.g. corona-related words. Then we further pre-train the RobBERT model using this dataset. To evaluate if our new model is a plug-in replacement for RobBERT, we introduce two additional criteria based on concept drift of existing tokens and alignment for novel tokens.We…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Data Stream Mining Techniques
Methods15 Ways to Contact How can i speak to someone at Delta Airlines · Attention Is All You Need · Linear Layer · Cosine Annealing · Softmax · Adam · Weight Decay · Residual Connection · Byte Pair Encoding · Dropout
