Crisp Attention: Regularizing Transformers via Structured Sparsity
Sagar Gandhi, Vishal Gandhi

TL;DR
Introducing structured sparsity into Transformer attention mechanisms can enhance model accuracy and act as an effective regularizer, challenging the belief that sparsity sacrifices performance.
Contribution
This work demonstrates that structured, post-hoc attention sparsity can improve Transformer accuracy, revealing sparsity's potential as a regularizer rather than just a computational shortcut.
Findings
Attention sparsity improved validation accuracy by 0.97%
80% attention sparsity achieved 91.59% accuracy on SST-2
Sparsity acts as an implicit regularizer preventing overfitting
Abstract
The quadratic computational cost of the self-attention mechanism is a primary challenge in scaling Transformer models. While attention sparsity is widely studied as a technique to improve computational efficiency, it is almost universally assumed to come at the cost of model accuracy. In this paper, we report a surprising counter-example to this common wisdom. By introducing structured, post-hoc sparsity to the attention mechanism of a DistilBERT model during fine-tuning on the SST-2 sentiment analysis task, we find that model accuracy improves significantly. Our model with 80\% attention sparsity achieves a validation accuracy of 91.59\%, a 0.97\% absolute improvement over the dense baseline. We hypothesize that this phenomenon is due to sparsity acting as a powerful implicit regularizer, preventing the model from overfitting by forcing it to make predictions with a more constrained…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsFerroelectric and Negative Capacitance Devices · Stochastic Gradient Optimization Techniques · Low-power high-performance VLSI design
