Crisp Attention: Regularizing Transformers via Structured Sparsity

Sagar Gandhi; Vishal Gandhi

arXiv:2508.06016·cs.CL·August 11, 2025

Crisp Attention: Regularizing Transformers via Structured Sparsity

Sagar Gandhi, Vishal Gandhi

PDF

Open Access

TL;DR

Introducing structured sparsity into Transformer attention mechanisms can enhance model accuracy and act as an effective regularizer, challenging the belief that sparsity sacrifices performance.

Contribution

This work demonstrates that structured, post-hoc attention sparsity can improve Transformer accuracy, revealing sparsity's potential as a regularizer rather than just a computational shortcut.

Findings

01

Attention sparsity improved validation accuracy by 0.97%

02

80% attention sparsity achieved 91.59% accuracy on SST-2

03

Sparsity acts as an implicit regularizer preventing overfitting

Abstract

The quadratic computational cost of the self-attention mechanism is a primary challenge in scaling Transformer models. While attention sparsity is widely studied as a technique to improve computational efficiency, it is almost universally assumed to come at the cost of model accuracy. In this paper, we report a surprising counter-example to this common wisdom. By introducing structured, post-hoc sparsity to the attention mechanism of a DistilBERT model during fine-tuning on the SST-2 sentiment analysis task, we find that model accuracy improves significantly. Our model with 80\% attention sparsity achieves a validation accuracy of 91.59\%, a 0.97\% absolute improvement over the dense baseline. We hypothesize that this phenomenon is due to sparsity acting as a powerful implicit regularizer, preventing the model from overfitting by forcing it to make predictions with a more constrained…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsFerroelectric and Negative Capacitance Devices · Stochastic Gradient Optimization Techniques · Low-power high-performance VLSI design