SSDTrain: An Activation Offloading Framework to SSDs for Faster Large   Language Model Training

Kun Wu; Jeongmin Brian Park; Xiaofan Zhang; Mert Hidayeto\u{g}lu,; Vikram Sharma Mailthody; Sitao Huang; Steven Sam Lumetta; Wen-mei Hwu

arXiv:2408.10013·cs.DC·February 18, 2025

SSDTrain: An Activation Offloading Framework to SSDs for Faster Large Language Model Training

Kun Wu, Jeongmin Brian Park, Xiaofan Zhang, Mert Hidayeto\u{g}lu,, Vikram Sharma Mailthody, Sitao Huang, Steven Sam Lumetta, Wen-mei Hwu

PDF

Open Access

TL;DR

SSDTrain is an adaptive framework that offloads activations to SSDs during large language model training, significantly reducing GPU memory usage while maintaining training speed and compatibility with popular frameworks.

Contribution

It introduces a novel activation offloading method to SSDs that overlaps data transfer with computation, enabling larger models and micro-batches without performance loss.

Findings

01

Reduces activation peak memory by 47%.

02

Maintains negligible overhead during training.

03

Enables larger micro-batches and improved throughput.

Abstract

The growth rate of the GPU memory capacity has not been able to keep up with that of the size of large language models (LLMs), hindering the model training process. In particular, activations -- the intermediate tensors produced during forward propagation and reused in backward propagation -- dominate the GPU memory use. This leads to high training overhead such as high weight update cost due to the small micro-batch size. To address this challenge, we propose SSDTrain, an adaptive activation offloading framework to high-capacity NVMe SSDs. SSDTrain reduces GPU memory usage without impacting performance by fully overlapping data transfers with computation. SSDTrain is compatible with popular deep learning frameworks like PyTorch, Megatron, and DeepSpeed, and it employs techniques such as tensor deduplication and forwarding to further enhance efficiency. We extensively experimented with…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques

MethodsGated Linear Unit · Refunds@Expedia|||How do I get a full refund from Expedia? · Cosine Annealing · Linear Warmup With Cosine Annealing · Discriminative Fine-Tuning · WordPiece · Linear Warmup With Linear Decay · Weight Decay · Adam · Byte Pair Encoding