TL;DR
This paper presents a cost-effective method for pretraining BERT-like models within 24 hours on modest hardware, making large language models more accessible to resource-constrained settings.
Contribution
It introduces a practical recipe combining software, design, and hyperparameter optimizations for affordable BERT pretraining.
Findings
Models achieve competitive GLUE performance
Pretraining time reduced to 24 hours
Cost lowered significantly compared to traditional methods
Abstract
While large language models a la BERT are used ubiquitously in NLP, pretraining them is considered a luxury that only a few well-funded industry labs can afford. How can one train such models with a more modest budget? We present a recipe for pretraining a masked language model in 24 hours using a single low-end deep learning server. We demonstrate that through a combination of software optimizations, design choices, and hyperparameter tuning, it is possible to produce models that are competitive with BERT-base on GLUE tasks at a fraction of the original pretraining cost.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Dropout · Adam · Dense Connections · Softmax · Refunds@Expedia|||How do I get a full refund from Expedia? · Linear Warmup With Linear Decay · Attention Dropout
