Packing Sparse Convolutional Neural Networks for Efficient Systolic Array Implementations: Column Combining Under Joint Optimization
H. T. Kung, Bradley McDanel, Sai Qian Zhang

TL;DR
This paper introduces a column combining method for packing sparse CNNs to improve systolic array efficiency, achieving significant gains in utilization, energy efficiency, and latency while maintaining accuracy with minimal retraining data.
Contribution
It proposes a novel column combining technique that increases systolic array utilization and efficiency, with a joint optimization of pruning and retraining for high accuracy.
Findings
Approximately 4x increase in array utilization.
3x improvement in energy efficiency.
12x reduction in inference latency.
Abstract
This paper describes a novel approach of packing sparse convolutional neural networks for their efficient systolic array implementations. By combining subsets of columns in the original filter matrix associated with a convolutional layer, we increase the utilization efficiency of the systolic array substantially (e.g., ~4x) due to the increased density of nonzeros in the resulting packed filter matrix. In combining columns, for each row, all filter weights but one with the largest magnitude are pruned. We retrain the remaining weights to preserve high accuracy. We demonstrate that in mitigating data privacy concerns the retraining can be accomplished with only fractions of the original dataset (e.g., 10\% for CIFAR-10). We study the effectiveness of this joint optimization for both high utilization and classification accuracy with ASIC and FPGA designs based on efficient bit-serial…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications · Machine Learning and ELM · Advanced Memory and Neural Computing
