Faster Multi-GPU Training with PPLL: A Pipeline Parallelism Framework   Leveraging Local Learning

Xiuyuan Guo (1); Chengqi Xu (1); Guinan Guo (3); Feiyu Zhu (4),; Changpeng Cai (5); Peizhe Wang (5); Xiaoming Wei (2); Junhao Su (2); Jialin; Gao (2) ((1) University of Southern California; (2) Meituan; (3) Sun Yat-sen; University; (4) University of Shanghai for Science; Technology; (5); Southeast University)

arXiv:2411.12780·cs.CV·November 21, 2024

Faster Multi-GPU Training with PPLL: A Pipeline Parallelism Framework Leveraging Local Learning

Xiuyuan Guo (1), Chengqi Xu (1), Guinan Guo (3), Feiyu Zhu (4),, Changpeng Cai (5), Peizhe Wang (5), Xiaoming Wei (2), Junhao Su (2), Jialin, Gao (2) ((1) University of Southern California, (2) Meituan, (3) Sun Yat-sen, University, (4) University of Shanghai for Science

PDF

Open Access

TL;DR

This paper introduces PPLL, a pipeline parallelism framework based on local learning, which improves multi-GPU training efficiency by reducing communication overhead and synchronization delays.

Contribution

PPLL leverages local learning and queue-based data transfer to enable seamless, efficient pipeline parallelism across multiple GPUs, outperforming traditional methods.

Findings

01

PPLL accelerates ViT training by 162% on 4 GPUs.

02

ResNet training speed increased by 33% with PPLL.

03

PPLL achieves comparable or better speed than traditional pipeline parallelism.

Abstract

Currently, training large-scale deep learning models is typically achieved through parallel training across multiple GPUs. However, due to the inherent communication overhead and synchronization delays in traditional model parallelism methods, seamless parallel training cannot be achieved, which, to some extent, affects overall training efficiency. To address this issue, we present PPLL (Pipeline Parallelism based on Local Learning), a novel framework that leverages local learning algorithms to enable effective parallel training across multiple GPUs. PPLL divides the model into several distinct blocks, each allocated to a separate GPU. By utilizing queues to manage data transfers between GPUs, PPLL ensures seamless cross-GPU communication, allowing multiple blocks to execute forward and backward passes in a pipelined manner. This design minimizes idle times and prevents bottlenecks…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMachine Learning and ELM · Advanced Image and Video Retrieval Techniques · Advanced Neural Network Applications

MethodsAttention Is All You Need · Dense Connections · Label Smoothing · Average Pooling · Adam · Residual Connection · Byte Pair Encoding · Convolution · Global Average Pooling · Kaiming Initialization