Deep Optimizer States: Towards Scalable Training of Transformer Models Using Interleaved Offloading

Avinash Maurya; Jie Ye; M. Mustafa Rafique; Franck Cappello; Bogdan Nicolae

arXiv:2410.21316·cs.LG·April 14, 2026

Deep Optimizer States: Towards Scalable Training of Transformer Models Using Interleaved Offloading

Avinash Maurya, Jie Ye, M. Mustafa Rafique, Franck Cappello, Bogdan Nicolae

PDF

1 Repo

TL;DR

This paper introduces Deep Optimizer States, a dynamic memory management technique that improves the training efficiency of large transformer models by intelligently offloading optimizer states between CPU and GPU.

Contribution

It proposes a novel method to split and schedule optimizer states across CPU and GPU based on memory utilization fluctuations, enhancing training speed.

Findings

01

Achieves 2.5× faster training iterations compared to existing methods.

02

Effectively manages host-GPU memory to reduce training costs for large models.

03

Demonstrates scalability improvements in transformer training.

Abstract

Transformers and large language models~(LLMs) have seen rapid adoption in all domains. Their sizes have exploded to hundreds of billions of parameters and keep increasing. Under these circumstances, the training of transformers is very expensive and often hits a ``memory wall'', i.e., even when using 3D parallelism (pipeline, tensor, data) and aggregating the memory of many GPUs, it is still not enough to hold the necessary data structures (model parameters, optimizer state, gradients, activations) in GPU memory. To compensate, state-of-the-art approaches offload the optimizer state, at least partially, to the host memory and perform hybrid CPU-GPU computations. However, the management of the combined host-GPU memory is often suboptimal and results in poor overlapping between data movements and computations. This leads to missed opportunities to simultaneously leverage the interconnect…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

datastates/artifacts
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.