Algorithm to Compilation Co-design: An Integrated View of Neural Network Sparsity
Fu-Ming Guo, Austin Huang

TL;DR
This paper explores how pruning and sparsity in neural networks, specifically BERT, can be optimized through integrated algorithm and compiler design, leading to significant speedups in inference performance.
Contribution
It introduces an integrated approach combining pruning algorithms with compiler support, demonstrating substantial runtime speedups and insights into optimal sparsity patterns for BERT.
Findings
4x speedup with BSR support in TVM
Optimal block shape for BERT attention is 32x1
Performance depends on block size and regularization parameters
Abstract
Reducing computation cost, inference latency, and memory footprint of neural networks are frequently cited as research motivations for pruning and sparsity. However, operationalizing those benefits and understanding the end-to-end effect of algorithm design and regularization on the runtime execution is not often examined in depth. Here we apply structured and unstructured pruning to attention weights of transformer blocks of the BERT language model, while also expanding block sparse representation (BSR) operations in the TVM compiler. Integration of BSR operations enables the TVM runtime execution to leverage structured pattern sparsity induced by model regularization. This integrated view of pruning algorithms enables us to study relationships between modeling decisions and their direct impact on sparsity-enhanced execution. Our main findings are: 1) we validate that performance…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications · Ferroelectric and Negative Capacitance Devices · Adversarial Robustness in Machine Learning
MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Attention Is All You Need · Pruning · Linear Layer · Adam · Layer Normalization · Multi-Head Attention · Linear Warmup With Linear Decay · Residual Connection · WordPiece
