TL;DR
This paper introduces Deep Optimizer States, a dynamic memory management technique that improves the training efficiency of large transformer models by intelligently offloading optimizer states between CPU and GPU.
Contribution
It proposes a novel method to split and schedule optimizer states across CPU and GPU based on memory utilization fluctuations, enhancing training speed.
Findings
Achieves 2.5× faster training iterations compared to existing methods.
Effectively manages host-GPU memory to reduce training costs for large models.
Demonstrates scalability improvements in transformer training.
Abstract
Transformers and large language models~(LLMs) have seen rapid adoption in all domains. Their sizes have exploded to hundreds of billions of parameters and keep increasing. Under these circumstances, the training of transformers is very expensive and often hits a ``memory wall'', i.e., even when using 3D parallelism (pipeline, tensor, data) and aggregating the memory of many GPUs, it is still not enough to hold the necessary data structures (model parameters, optimizer state, gradients, activations) in GPU memory. To compensate, state-of-the-art approaches offload the optimizer state, at least partially, to the host memory and perform hybrid CPU-GPU computations. However, the management of the combined host-GPU memory is often suboptimal and results in poor overlapping between data movements and computations. This leads to missed opportunities to simultaneously leverage the interconnect…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
