Cramming: Training a Language Model on a Single GPU in One Day
Jonas Geiping, Tom Goldstein

TL;DR
This paper explores training a transformer-based language model from scratch on a single GPU within one day, analyzing the challenges, modifications, and performance scaling laws in this highly constrained environment.
Contribution
It demonstrates that competitive language model performance can be achieved with limited compute by re-analyzing and adapting training components for a single-GPU, one-day training scenario.
Findings
Performance closely follows large-scale scaling laws
Modified training pipeline achieves results near BERT
Insights into effective modifications for limited compute training
Abstract
Recent trends in language modeling have focused on increasing performance through scaling, and have resulted in an environment where training language models is out of reach for most researchers and practitioners. While most in the community are asking how to push the limits of extreme computation, we ask the opposite question: How far can we get with a single GPU in just one day? We investigate the downstream performance achievable with a transformer-based language model trained completely from scratch with masked language modeling for a single day on a single consumer GPU. Aside from re-analyzing nearly all components of the pretraining pipeline for this scenario and providing a modified pipeline with performance close to BERT, we investigate why scaling down is hard, and which modifications actually improve performance in this scenario. We provide evidence that even in this…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsTopic Modeling · Advanced Neural Network Applications · Ferroelectric and Negative Capacitance Devices
MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Multi-Head Attention · Attention Is All You Need · Linear Layer · Softmax · Weight Decay · WordPiece · Dense Connections · Linear Warmup With Linear Decay · Layer Normalization
