Overcoming a Theoretical Limitation of Self-Attention

David Chiang; Peter Cholak

arXiv:2202.12172·cs.LG·February 25, 2022·1 cites

Overcoming a Theoretical Limitation of Self-Attention

David Chiang, Peter Cholak

PDF

Open Access 1 Repo

TL;DR

This paper identifies a fundamental limitation of self-attention in transformers with certain regular languages and proposes three methods to overcome it, including constructing perfect classifiers and improving generalization.

Contribution

It introduces novel transformer architectures capable of recognizing specific regular languages with perfect accuracy and offers solutions to enhance length generalization.

Findings

01

Constructed transformers recognizing PARITY and FIRST with perfect accuracy.

02

Used layer normalization to reduce cross-entropy close to zero.

03

Provided a simple remedy to improve length generalization in translation.

Abstract

Although transformers are remarkably effective for many tasks, there are some surprisingly easy-looking regular languages that they struggle with. Hahn shows that for languages where acceptance depends on a single input symbol, a transformer's classification decisions become less and less confident (that is, with cross-entropy approaching 1 bit per string) as input strings get longer and longer. We examine this limitation using two languages: PARITY, the language of bit strings with an odd number of 1s, and FIRST, the language of bit strings starting with a 1. We demonstrate three ways of overcoming the limitation suggested by Hahn's lemma. First, we settle an open question by constructing a transformer that recognizes PARITY with perfect accuracy, and similarly for FIRST. Second, we use layer normalization to bring the cross-entropy of both models arbitrarily close to zero. Third, when…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

ndnlp/parity
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling · semigroups and automata theory

MethodsLayer Normalization