Cramming: Training a Language Model on a Single GPU in One Day

Jonas Geiping; Tom Goldstein

arXiv:2212.14034·cs.CL·December 29, 2022·30 cites

Cramming: Training a Language Model on a Single GPU in One Day

Jonas Geiping, Tom Goldstein

PDF

Open Access 1 Repo 2 Models 3 Datasets 1 Video

TL;DR

This paper explores training a transformer-based language model from scratch on a single GPU within one day, analyzing the challenges, modifications, and performance scaling laws in this highly constrained environment.

Contribution

It demonstrates that competitive language model performance can be achieved with limited compute by re-analyzing and adapting training components for a single-GPU, one-day training scenario.

Findings

01

Performance closely follows large-scale scaling laws

02

Modified training pipeline achieves results near BERT

03

Insights into effective modifications for limited compute training

Abstract

Recent trends in language modeling have focused on increasing performance through scaling, and have resulted in an environment where training language models is out of reach for most researchers and practitioners. While most in the community are asking how to push the limits of extreme computation, we ask the opposite question: How far can we get with a single GPU in just one day? We investigate the downstream performance achievable with a transformer-based language model trained completely from scratch with masked language modeling for a single day on a single consumer GPU. Aside from re-analyzing nearly all components of the pretraining pipeline for this scenario and providing a modified pipeline with performance close to BERT, we investigate why scaling down is hard, and which modifications actually improve performance in this scenario. We provide evidence that even in this…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

jonasgeiping/cramming
pytorchOfficial

Models

Datasets

Videos

Cramming: Training a Language Model on a single GPU in one day.· slideslive

Taxonomy

TopicsTopic Modeling · Advanced Neural Network Applications · Ferroelectric and Negative Capacitance Devices

MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Multi-Head Attention · Attention Is All You Need · Linear Layer · Softmax · Weight Decay · WordPiece · Dense Connections · Linear Warmup With Linear Decay · Layer Normalization