Colossal-Auto: Unified Automation of Parallelization and Activation   Checkpoint for Large-scale Models

Yuliang Liu; Shenggui Li; Jiarui Fang; Yanjun Shao; Boyuan Yao; Yang; You

arXiv:2302.02599·cs.LG·February 23, 2023·6 cites

Colossal-Auto: Unified Automation of Parallelization and Activation Checkpoint for Large-scale Models

Yuliang Liu, Shenggui Li, Jiarui Fang, Yanjun Shao, Boyuan Yao, Yang, You

PDF

Open Access 1 Repo

TL;DR

Colossal-Auto presents a unified system that jointly optimizes parallelization strategies and activation checkpointing for large-scale model training, improving efficiency and reducing manual effort.

Contribution

It introduces a novel approach to jointly optimize distributed execution and checkpointing plans, along with a symbolic profiler for memory and compute estimation.

Findings

01

Joint optimization improves training efficiency

02

Symbolic profiler enables quick memory and compute estimation

03

Open-source implementation available for easy adoption

Abstract

In recent years, large-scale models have demonstrated state-of-the-art performance across various domains. However, training such models requires various techniques to address the problem of limited computing power and memory on devices such as GPUs. Some commonly used techniques include pipeline parallelism, tensor parallelism, and activation checkpointing. While existing works have focused on finding efficient distributed execution plans (Zheng et al. 2022) and activation checkpoint scheduling (Herrmann et al. 2019, Beaumont et al. 2021}, there has been no method proposed to optimize these two plans jointly. Moreover, ahead-of-time compilation relies heavily on accurate memory and computing overhead estimation, which is often time-consuming and misleading. Existing training systems and machine learning pipelines either physically execute each operand or estimate memory usage with a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

hpcaitech/colossalai
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsParallel Computing and Optimization Techniques · Advanced Neural Network Applications · Age of Information Optimization

MethodsGradient Checkpointing · SPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings