Pipe-BD: Pipelined Parallel Blockwise Distillation
Hongsun Jang, Jaewon Jung, Jaeyong Song, Joonsang Yu, Youngsok Kim,, and Jinho Lee

TL;DR
Pipe-BD introduces a pipeline parallelism approach to accelerate blockwise distillation of large neural networks, reducing redundant computations and improving resource utilization without altering the core distillation process.
Contribution
It proposes Pipe-BD, a novel parallelization technique that enhances efficiency and speed of blockwise distillation through pipeline and hybrid parallelism, addressing existing computational bottlenecks.
Findings
Significant acceleration in training time across multiple models and datasets.
Improved GPU utilization and resource efficiency.
Effective workload balancing with hybrid parallelism.
Abstract
Training large deep neural network models is highly challenging due to their tremendous computational and memory requirements. Blockwise distillation provides one promising method towards faster convergence by splitting a large model into multiple smaller models. In state-of-the-art blockwise distillation methods, training is performed block-by-block in a data-parallel manner using multiple GPUs. To produce inputs for the student blocks, the teacher model is executed from the beginning until the current block under training. However, this results in a high overhead of redundant teacher execution, low GPU utilization, and extra data loading. To address these problems, we propose Pipe-BD, a novel parallelization method for blockwise distillation. Pipe-BD aggressively utilizes pipeline parallelism for blockwise distillation, eliminating redundant teacher block execution and increasing…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications · Parallel Computing and Optimization Techniques · Stochastic Gradient Optimization Techniques
