TL;DR
This paper introduces optimized GPU kernels and multi-GPU strategies for sparse deep neural network inference, achieving significant speedups and efficiency improvements over previous methods and champion solutions.
Contribution
It presents novel fused sparse matrix multiplication kernels and a multi-GPU parallelization approach tailored for sparse DNN inference on GPUs.
Findings
Up to 180 tera-edges/sec inference throughput.
4.3x faster single GPU performance than 2019 champion.
2.37x throughput improvement on NVIDIA A100 over V100.
Abstract
This paper presents GPU performance optimization and scaling results for inference models of the Sparse Deep Neural Network Challenge 2020. Demands for network quality have increased rapidly, pushing the size and thus the memory requirements of many neural networks beyond the capacity of available accelerators. Sparse deep neural networks (SpDNN) have shown promise for reining in the memory footprint of large neural networks. However, there is room for improvement in implementing SpDNN operations on GPUs. This work presents optimized sparse matrix multiplication kernels fused with the ReLU function. The optimized kernels reuse input feature maps from the shared memory and sparse weights from registers. For multi-GPU parallelism, our SpDNN implementation duplicates weights and statically partition the feature maps across GPUs. Results for the challenge benchmarks show that the proposed…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
Methods*Communicated@Fast*How Do I Communicate to Expedia?
