Optimizing Large Model Training through Overlapped Activation Recomputation
Ping Chen, Wenjie Zhang, Shuibing He, Weijian Chen, Siling Yang, Kexin, Huang, Yanlong Yin, Xuan Zhan, Yingjie Gu, Zhuwei Peng, Yi Zheng, Zhefeng, Wang, Gang Chen

TL;DR
Lynx is a novel recomputation framework that overlaps recomputation with communication in large model training pipelines, significantly reducing overhead and improving throughput for models with billions of parameters.
Contribution
The paper introduces Lynx, a heuristic-based recomputation scheduling and model partitioning approach that reduces overhead and enhances training efficiency for large neural networks.
Findings
Lynx outperforms existing methods by up to 1.37x in training throughput.
The approach effectively overlaps recomputation with communication to reduce overhead.
It demonstrates significant improvements on GPT models with 1.3B-23B parameters.
Abstract
Large model training often uses recomputation to alleviate memory pressure and pipelines to exploit the parallelism of data, tensors, and devices. However, existing recomputation approaches may incur high overhead when training real-world models, as they are executed on demand in the critical training path. In this paper, we present Lynx, a new recomputation framework to reduce overhead by overlapping recomputation with communication in training pipelines. To reduce the large search space for recomputation strategies, we propose a heuristic-based recomputation scheduling algorithm, which is based on the observation that there are identical structures in large DNN models so that we can apply the same scheduling policy to all such structures. Additionally, we propose a recomputation-aware model partitioning method to balance each stage's execution time for improved training throughput.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications · Gaussian Processes and Bayesian Inference · Neural Networks and Applications
MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Attention Is All You Need · Cosine Annealing · Byte Pair Encoding · Adam · Attention Dropout · Weight Decay · Linear Warmup With Cosine Annealing · Linear Layer · Multi-Head Attention
