Stack Attention: Improving the Ability of Transformers to Model   Hierarchical Patterns

Brian DuSell; David Chiang

arXiv:2310.01749·cs.CL·January 25, 2024

Stack Attention: Improving the Ability of Transformers to Model Hierarchical Patterns

Brian DuSell, David Chiang

PDF

Open Access 1 Repo

TL;DR

This paper introduces stack attention, a novel attention mechanism for transformers that models hierarchical and context-free language structures without explicit syntactic supervision, improving recognition of complex patterns.

Contribution

The paper proposes stack attention, integrating stack-based structures into transformers to better model hierarchical patterns and context-free languages, with two variants inspired by pushdown automata.

Findings

01

Stack attention enhances transformers' ability to recognize CFLs.

02

Transformers with stack attention outperform standard models on complex hierarchical tasks.

03

Stack attention improves natural language modeling efficiency.

Abstract

Attention, specifically scaled dot-product attention, has proven effective for natural language, but it does not have a mechanism for handling hierarchical patterns of arbitrary nesting depth, which limits its ability to recognize certain syntactic structures. To address this shortcoming, we propose stack attention: an attention operator that incorporates stacks, inspired by their theoretical connections to context-free languages (CFLs). We show that stack attention is analogous to standard attention, but with a latent model of syntax that requires no syntactic supervision. We propose two variants: one related to deterministic pushdown automata (PDAs) and one based on nondeterministic PDAs, which allows transformers to recognize arbitrary CFLs. We show that transformers with stack attention are very effective at learning CFLs that standard transformers struggle on, achieving strong…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

bdusell/stack-attention
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsFerroelectric and Negative Capacitance Devices · Topic Modeling · Text Readability and Simplification