Cost-Efficient LLM Training with Lifetime-Aware Tensor Offloading via GPUDirect Storage

Ziqi Yuan; Haoyang Zhang; Yirui Eric Zhou; Apoorve Mohan; I-Hsin Chung; Seetharami Seelam; Jian Huang

arXiv:2506.06472·cs.DC·June 10, 2025

Cost-Efficient LLM Training with Lifetime-Aware Tensor Offloading via GPUDirect Storage

Ziqi Yuan, Haoyang Zhang, Yirui Eric Zhou, Apoorve Mohan, I-Hsin Chung, Seetharami Seelam, Jian Huang

PDF

Open Access

TL;DR

This paper introduces TERAIO, a lifetime-aware tensor offloading framework that efficiently manages GPU memory for large language model training by offloading inactive tensors to SSDs, significantly improving training speed.

Contribution

The paper presents a novel lifetime-aware tensor offloading framework, TERAIO, which optimizes tensor migration using GPU memory profiling and GPUDirect storage to enhance LLM training performance.

Findings

01

TERAIO improves LLM training performance by 1.47x on average.

02

It achieves 80.7% of ideal performance with unlimited GPU memory.

03

The framework effectively offloads inactive tensors to SSDs without stalling training.

Abstract

We present the design and implementation of a new lifetime-aware tensor offloading framework for GPU memory expansion using low-cost PCIe-based solid-state drives (SSDs). Our framework, TERAIO, is developed explicitly for large language model (LLM) training with multiple GPUs and multiple SSDs. Its design is driven by our observation that the active tensors take only a small fraction (1.7% on average) of allocated GPU memory in each LLM training iteration, the inactive tensors are usually large and will not be used for a long period of time, creating ample opportunities for offloading/prefetching tensors to/from slow SSDs without stalling the GPU training process. TERAIO accurately estimates the lifetime (active period of time in GPU memory) of each tensor with the profiling of the first few iterations in the training process. With the tensor lifetime analysis, TERAIO will generate an…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsParallel Computing and Optimization Techniques · Advanced Data Storage Technologies · Cloud Computing and Resource Management

Methods1x1 Convolution · Non Maximum Suppression · Self-Supervised Deep Supervision · ZeRO-Infinity · Convolution · ZeRO-Offload · SSD