Training Large Neural Networks with Constant Memory using a New   Execution Algorithm

Bharadwaj Pudipeddi; Maral Mesmakhosroshahi; Jinwen Xi; and Sujeeth; Bharadwaj

arXiv:2002.05645·cs.LG·June 8, 2020·24 cites

Training Large Neural Networks with Constant Memory using a New Execution Algorithm

Bharadwaj Pudipeddi, Maral Mesmakhosroshahi, Jinwen Xi, and Sujeeth, Bharadwaj

PDF

Open Access 2 Repos

TL;DR

This paper introduces L2L, a relay-style execution algorithm that enables training large neural networks with constant memory by streaming layers through a host memory, significantly reducing memory requirements and increasing throughput.

Contribution

The paper presents a novel layer-to-layer execution method that allows training models with billions of parameters on limited hardware without partitioning, and introduces a constant memory variation of L2L.

Findings

01

45% reduction in memory usage compared to baseline

02

40% increase in throughput over state-of-the-art methods

03

Able to fit 50 billion parameter models on a single GPU with 16GB memory

Abstract

Widely popular transformer-based NLP models such as BERT and Turing-NLG have enormous capacity trending to billions of parameters. Current execution methods demand brute-force resources such as HBM devices and high speed interconnectivity for data parallelism. In this paper, we introduce a new relay-style execution technique called L2L (layer-to-layer) where at any given moment, the device memory is primarily populated only with the executing layer(s)'s footprint. The model resides in the DRAM memory attached to either a CPU or an FPGA as an entity we call eager param-server (EPS). To overcome the bandwidth issues of shuttling parameters to and from EPS, the model is executed a layer at a time across many micro-batches instead of the conventional method of minibatches over whole model. L2L is implemented using 16GB V100 devices for BERT-Large running it with a device batch size of up to…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsFerroelectric and Negative Capacitance Devices · Advanced Neural Network Applications · Advanced Memory and Neural Computing

MethodsLinear Layer · SPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings · Cosine Annealing · Residual Connection · Attention Dropout · Linear Warmup With Linear Decay · Linear Warmup With Cosine Annealing · Byte Pair Encoding · Dense Connections · Weight Decay