Optimizing Large Model Training through Overlapped Activation   Recomputation

Ping Chen; Wenjie Zhang; Shuibing He; Weijian Chen; Siling Yang; Kexin; Huang; Yanlong Yin; Xuan Zhan; Yingjie Gu; Zhuwei Peng; Yi Zheng; Zhefeng; Wang; Gang Chen

arXiv:2406.08756·cs.DC·March 31, 2025·1 cites

Optimizing Large Model Training through Overlapped Activation Recomputation

Ping Chen, Wenjie Zhang, Shuibing He, Weijian Chen, Siling Yang, Kexin, Huang, Yanlong Yin, Xuan Zhan, Yingjie Gu, Zhuwei Peng, Yi Zheng, Zhefeng, Wang, Gang Chen

PDF

Open Access

TL;DR

Lynx is a novel recomputation framework that overlaps recomputation with communication in large model training pipelines, significantly reducing overhead and improving throughput for models with billions of parameters.

Contribution

The paper introduces Lynx, a heuristic-based recomputation scheduling and model partitioning approach that reduces overhead and enhances training efficiency for large neural networks.

Findings

01

Lynx outperforms existing methods by up to 1.37x in training throughput.

02

The approach effectively overlaps recomputation with communication to reduce overhead.

03

It demonstrates significant improvements on GPT models with 1.3B-23B parameters.

Abstract

Large model training often uses recomputation to alleviate memory pressure and pipelines to exploit the parallelism of data, tensors, and devices. However, existing recomputation approaches may incur high overhead when training real-world models, as they are executed on demand in the critical training path. In this paper, we present Lynx, a new recomputation framework to reduce overhead by overlapping recomputation with communication in training pipelines. To reduce the large search space for recomputation strategies, we propose a heuristic-based recomputation scheduling algorithm, which is based on the observation that there are identical structures in large DNN models so that we can apply the same scheduling policy to all such structures. Additionally, we propose a recomputation-aware model partitioning method to balance each stage's execution time for improved training throughput.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Neural Network Applications · Gaussian Processes and Bayesian Inference · Neural Networks and Applications

MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Attention Is All You Need · Cosine Annealing · Byte Pair Encoding · Adam · Attention Dropout · Weight Decay · Linear Warmup With Cosine Annealing · Linear Layer · Multi-Head Attention