Blink: Fast and Generic Collectives for Distributed ML

Guanhua Wang; Shivaram Venkataraman; Amar Phanishayee; Jorgen Thelin,; Nikhil Devanur; Ion Stoica

arXiv:1910.04940·cs.DC·October 14, 2019·20 cites

Blink: Fast and Generic Collectives for Distributed ML

Guanhua Wang, Shivaram Venkataraman, Amar Phanishayee, Jorgen Thelin,, Nikhil Devanur, Ion Stoica

PDF

Open Access

TL;DR

Blink is a new collective communication library that dynamically optimizes model parameter synchronization across GPUs, significantly reducing training time and improving efficiency in distributed machine learning.

Contribution

It introduces a novel approach to generate optimal communication primitives using spanning trees and leverages heterogeneous channels for faster data transfer.

Findings

01

Up to 8x faster model synchronization compared to NCCL.

02

Reduces end-to-end training time for image classification by up to 40%.

03

Effectively utilizes heterogeneous communication channels.

Abstract

Model parameter synchronization across GPUs introduces high overheads for data-parallel training at scale. Existing parameter synchronization protocols cannot effectively leverage available network resources in the face of ever increasing hardware heterogeneity. To address this, we propose Blink, a collective communication library that dynamically generates optimal communication primitives by packing spanning trees. We propose techniques to minimize the number of trees generated and extend Blink to leverage heterogeneous communication channels for faster data transfers. Evaluations show that compared to the state-of-the-art (NCCL), Blink can achieve up to 8x faster model synchronization, and reduce end-to-end training time for image classification tasks by up to 40%.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsStochastic Gradient Optimization Techniques · Parallel Computing and Optimization Techniques · Advanced Neural Network Applications

MethodsBlink Communication