Breaking the Memory Wall: A Study of I/O Patterns and GPU Memory Utilization for Hybrid CPU-GPU Offloaded Optimizers
Avinash Maurya, Jie Ye, M. Mustafa Rafique, Franck Cappello, Bogdan, Nicolae

TL;DR
This paper analyzes the memory and I/O patterns in hybrid CPU-GPU training of large transformers, revealing bottlenecks and opportunities for optimizing cost and performance in offloaded optimizer strategies.
Contribution
It provides a detailed characterization of GPU memory utilization and data transfer behaviors during offloaded training, addressing a gap in understanding of these complex interactions.
Findings
GPU memory utilization varies significantly during training iterations.
Data transfers between host and GPU are a major bottleneck.
Opportunities exist to optimize overlapping of computation and data movement.
Abstract
Transformers and LLMs have seen rapid adoption in all domains. Their sizes have exploded to hundreds of billions of parameters and keep increasing. Under these circumstances, the training of transformers is slow and often takes in the order of weeks or months. Thanks to 3D model parallelism (data, pipeline, and tensor-level parallelism), the training can scale to a large number of GPUs, which reduces the duration of the training but dramatically increases the cost. Even when a large number of GPUs are available, the aggregated GPU memory is often not enough to hold the full training state (optimizer state, model parameters, and gradients). To compensate, state-of-the-art approaches offload the optimizer state at least partially to the host memory and perform hybrid CPU-GPU computations. Such flexible solutions dramatically reduce the GPU memory utilization, which makes it feasible to…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
