Breaking BERT: Evaluating and Optimizing Sparsified Attention
Siddhartha Brahma, Polina Zablotskaia, David Mimno

TL;DR
This paper investigates the effects of sparsifying attention in transformers, finding that significant sparsity can be achieved with minimal performance loss if applied strategically, and introduces an algorithm to optimize sparsity patterns.
Contribution
It evaluates various sparsification patterns in transformers, demonstrating that targeted sparsity maintains performance and proposing a learnable sparsity method for better efficiency.
Findings
78% sparsity can be tolerated at later layers
Neighboring token connections are most important
Learned sparsity approaches approach existing performance
Abstract
Transformers allow attention between all pairs of tokens, but there is reason to believe that most of these connections - and their quadratic time and memory - may not be necessary. But which ones? We evaluate the impact of sparsification patterns with a series of ablation experiments. First, we compare masks based on syntax, lexical similarity, and token position to random connections, and measure which patterns reduce performance the least. We find that on three common finetuning tasks even using attention that is at least 78% sparse can have little effect on performance if applied at later transformer layers, but that applying sparsity throughout the network reduces performance significantly. Second, we vary the degree of sparsity for three patterns supported by previous work, and find that connections to neighbouring tokens are the most significant. Finally, we treat sparsity as an…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsFerroelectric and Negative Capacitance Devices · Advanced Memory and Neural Computing · Advanced Neural Network Applications
