Benchmarking down-scaled (not so large) pre-trained language models
M. A{\ss}enmacher, P. Schulze, C. Heumann

TL;DR
This paper benchmarks smaller pre-trained Transformer models on GLUE tasks, systematically comparing objectives and hyperparameters, and explores scaling strategies to improve performance efficiently.
Contribution
It provides a systematic comparison of pre-training objectives and hyperparameters on down-scaled models, and investigates effective scaling methods for Transformer-based language models.
Findings
MLM + NSP outperforms MLM and standard LM objectives.
Increasing model size yields better performance than longer training.
Scaling strategies improve model efficiency and effectiveness.
Abstract
Large Transformer-based language models are pre-trained on corpora of varying sizes, for a different number of steps and with different batch sizes. At the same time, more fundamental components, such as the pre-training objective or architectural hyperparameters, are modified. In total, it is therefore difficult to ascribe changes in performance to specific factors. Since searching the hyperparameter space over the full systems is too costly, we pre-train down-scaled versions of several popular Transformer-based architectures on a common pre-training corpus and benchmark them on a subset of the GLUE tasks (Wang et al., 2018). Specifically, we systematically compare three pre-training objectives for different shape parameters and model sizes, while also varying the number of pre-training steps and the batch size. In our experiments MLM + NSP (BERT-style) consistently outperforms MLM…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Computational and Text Analysis Methods
