Stack Attention: Improving the Ability of Transformers to Model Hierarchical Patterns
Brian DuSell, David Chiang

TL;DR
This paper introduces stack attention, a novel attention mechanism for transformers that models hierarchical and context-free language structures without explicit syntactic supervision, improving recognition of complex patterns.
Contribution
The paper proposes stack attention, integrating stack-based structures into transformers to better model hierarchical patterns and context-free languages, with two variants inspired by pushdown automata.
Findings
Stack attention enhances transformers' ability to recognize CFLs.
Transformers with stack attention outperform standard models on complex hierarchical tasks.
Stack attention improves natural language modeling efficiency.
Abstract
Attention, specifically scaled dot-product attention, has proven effective for natural language, but it does not have a mechanism for handling hierarchical patterns of arbitrary nesting depth, which limits its ability to recognize certain syntactic structures. To address this shortcoming, we propose stack attention: an attention operator that incorporates stacks, inspired by their theoretical connections to context-free languages (CFLs). We show that stack attention is analogous to standard attention, but with a latent model of syntax that requires no syntactic supervision. We propose two variants: one related to deterministic pushdown automata (PDAs) and one based on nondeterministic PDAs, which allows transformers to recognize arbitrary CFLs. We show that transformers with stack attention are very effective at learning CFLs that standard transformers struggle on, achieving strong…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsFerroelectric and Negative Capacitance Devices · Topic Modeling · Text Readability and Simplification
