MEMO: Fine-grained Tensor Management For Ultra-long Context LLM Training

Pinxue Zhao; Hailin Zhang; Fangcheng Fu; Xiaonan Nie; Qibin Liu; Fang; Yang; Yuanbo Peng; Dian Jiao; Shuaipeng Li; Jinbao Xue; Yangyu Tao; Bin Cui

arXiv:2407.12117·cs.LG·January 16, 2025·2 cites

MEMO: Fine-grained Tensor Management For Ultra-long Context LLM Training

Pinxue Zhao, Hailin Zhang, Fangcheng Fu, Xiaonan Nie, Qibin Liu, Fang, Yang, Yuanbo Peng, Dian Jiao, Shuaipeng Li, Jinbao Xue, Yangyu Tao, Bin Cui

PDF

Open Access

TL;DR

MEMO introduces a fine-grained activation memory management framework that enables efficient training of ultra-long context LLMs by offloading activations to CPU memory and optimizing memory reuse, significantly improving GPU utilization.

Contribution

The paper presents MEMO, a novel framework that manages activation memory at a fine granularity, reducing fragmentation and communication overhead for long context LLM training.

Findings

01

Achieves 1.97x and 1.80x MFU improvements over Megatron-LM and DeepSpeed.

02

Enables training of 7B LLM with 1 million sequence length on 8 GPUs.

03

Reduces memory fragmentation and recomputation overhead.

Abstract

Nowadays, Large Language Models (LLMs) have been trained using extended context lengths to foster more creative applications. However, long context training poses great challenges considering the constraint of GPU memory. It not only leads to substantial activation memory consumption during training, but also incurs considerable memory fragmentation. To facilitate long context training, existing frameworks have adopted strategies such as recomputation and various forms of parallelisms. Nevertheless, these techniques rely on redundant computation or extensive communication, resulting in low Model FLOPS Utilization (MFU). In this paper, we propose MEMO, a novel LLM training framework designed for fine-grained activation memory management. Given the quadratic scaling of computation and linear scaling of memory with sequence lengths when using FlashAttention, we offload memory-consuming…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Neural Network Applications · Medical Image Segmentation Techniques · Vehicle License Plate Recognition

MethodsFragmentation