Algorithm to Compilation Co-design: An Integrated View of Neural Network   Sparsity

Fu-Ming Guo; Austin Huang

arXiv:2106.08846·cs.LG·June 18, 2021

Algorithm to Compilation Co-design: An Integrated View of Neural Network Sparsity

Fu-Ming Guo, Austin Huang

PDF

Open Access

TL;DR

This paper explores how pruning and sparsity in neural networks, specifically BERT, can be optimized through integrated algorithm and compiler design, leading to significant speedups in inference performance.

Contribution

It introduces an integrated approach combining pruning algorithms with compiler support, demonstrating substantial runtime speedups and insights into optimal sparsity patterns for BERT.

Findings

01

4x speedup with BSR support in TVM

02

Optimal block shape for BERT attention is 32x1

03

Performance depends on block size and regularization parameters

Abstract

Reducing computation cost, inference latency, and memory footprint of neural networks are frequently cited as research motivations for pruning and sparsity. However, operationalizing those benefits and understanding the end-to-end effect of algorithm design and regularization on the runtime execution is not often examined in depth. Here we apply structured and unstructured pruning to attention weights of transformer blocks of the BERT language model, while also expanding block sparse representation (BSR) operations in the TVM compiler. Integration of BSR operations enables the TVM runtime execution to leverage structured pattern sparsity induced by model regularization. This integrated view of pruning algorithms enables us to study relationships between modeling decisions and their direct impact on sparsity-enhanced execution. Our main findings are: 1) we validate that performance…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Neural Network Applications · Ferroelectric and Negative Capacitance Devices · Adversarial Robustness in Machine Learning

MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Attention Is All You Need · Pruning · Linear Layer · Adam · Layer Normalization · Multi-Head Attention · Linear Warmup With Linear Decay · Residual Connection · WordPiece