Accelerating Deep Learning Inference with Cross-Layer Data Reuse on GPUs
Xueying Wang, Guangli Li, Xiao Dong, Jiansong Li, Lei Liu, and, Xiaobing Feng

TL;DR
This paper introduces a novel GPU-based layer fusion technique for CNNs that enhances data reuse and reduces inference time, achieving over 2x speedup on various CNN architectures.
Contribution
It proposes new fusion modes and an efficient code generation approach for cross-layer data reuse in CNN inference on GPUs.
Findings
Average speedup of 2.02x on CNN structures
1.57x speedup on end-to-end SqueezeNet inference
Effective utilization of multi-level memory hierarchy
Abstract
Accelerating the deep learning inference is very important for real-time applications. In this paper, we propose a novel method to fuse the layers of convolutional neural networks (CNNs) on Graphics Processing Units (GPUs), which applies data reuse analysis and access optimization in different levels of the memory hierarchy. To achieve the balance between computation and memory access, we explore the fusion opportunities in the CNN computation graph and propose three fusion modes of convolutional neural networks: straight, merge and split. Then, an approach for generating efficient fused code is designed, which goes deeper in multi-level memory usage for cross-layer data reuse. The effectiveness of our method is evaluated with the network layers from state-of-the-art CNNs on two different GPU platforms, NVIDIA TITAN Xp and Tesla P4. The experiments show that the average speedup is 2.02x…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications · Adversarial Robustness in Machine Learning · Domain Adaptation and Few-Shot Learning
