Characterizing the Expressivity of Local Attention in Transformers
Jiaoda Li, Ryan Cotterell

TL;DR
This paper provides a formal analysis of local attention in transformers, showing it expands the model's expressivity and improves language modeling performance when combined with global attention.
Contribution
It introduces a formal framework linking local attention to recognizer expressivity, demonstrating their complementary nature and benefits in language modeling.
Findings
Local attention introduces a second temporal operator, enlarging expressivity.
Hybrid global-local transformers outperform global-only models.
Experiments confirm theoretical predictions with improved language modeling results.
Abstract
The transformer is the most popular neural architecture for language modeling. The cornerstone of the transformer is its global attention mechanism, which lets the model aggregate information from all preceding tokens before generating the next token. One common variant of attention is called local attention, which restricts each token to aggregating information from a bounded window of predecessors, reducing the quadratic cost of global attention to linear. Although this restriction is usually motivated by efficiency, it has also been found to improve model quality, a phenomenon that has so far lacked a satisfactory explanation. We provide a formal account of this phenomenon in terms of recognizer expressivity. It has been shown that fixed-precision transformers with global attention correspond to a fragment of linear temporal logic containing a single past operator. We additionally…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
