TL;DR
ZNN introduces a parallel algorithm for training 3D convolutional networks that achieves near-linear speedup on multi-core and many-core shared memory machines, making ConvNet training faster and more scalable.
Contribution
The paper presents a novel task-based parallel algorithm for ConvNet training that attains near-linear speedup on shared-memory architectures, with an efficient implementation called ZNN.
Findings
ZNN achieves roughly linear speedup with the number of CPU cores.
Over 90x speedup on a many-core Xeon Phi CPU.
Performance varies with network architecture and kernel sizes.
Abstract
Convolutional networks (ConvNets) have become a popular approach to computer vision. It is important to accelerate ConvNet training, which is computationally costly. We propose a novel parallel algorithm based on decomposition into a set of tasks, most of which are convolutions or FFTs. Applying Brent's theorem to the task dependency graph implies that linear speedup with the number of processors is attainable within the PRAM model of parallel computation, for wide network architectures. To attain such performance on real shared-memory machines, our algorithm computes convolutions converging on the same node of the network with temporal locality to reduce cache misses, and sums the convergent convolution outputs via an almost wait-free concurrent method to reduce time spent in critical sections. We implement the algorithm with a publicly available software package called ZNN.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsConvolution
