How to Train BERT with an Academic Budget

Peter Izsak; Moshe Berchansky; Omer Levy

arXiv:2104.07705·cs.CL·September 10, 2021

How to Train BERT with an Academic Budget

Peter Izsak, Moshe Berchansky, Omer Levy

PDF

4 Repos

TL;DR

This paper presents a cost-effective method for pretraining BERT-like models within 24 hours on modest hardware, making large language models more accessible to resource-constrained settings.

Contribution

It introduces a practical recipe combining software, design, and hyperparameter optimizations for affordable BERT pretraining.

Findings

01

Models achieve competitive GLUE performance

02

Pretraining time reduced to 24 hours

03

Cost lowered significantly compared to traditional methods

Abstract

While large language models a la BERT are used ubiquitously in NLP, pretraining them is considered a luxury that only a few well-funded industry labs can afford. How can one train such models with a more modest budget? We present a recipe for pretraining a masked language model in 24 hours using a single low-end deep learning server. We demonstrate that through a combination of software optimizations, design choices, and hyperparameter tuning, it is possible to produce models that are competitive with BERT-base on GLUE tasks at a fraction of the original pretraining cost.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Dropout · Adam · Dense Connections · Softmax · Refunds@Expedia|||How do I get a full refund from Expedia? · Linear Warmup With Linear Decay · Attention Dropout