Are Transformers universal approximators of sequence-to-sequence functions?
Chulhee Yun, Srinadh Bhojanapalli, Ankit Singh Rawat, Sashank J., Reddi, Sanjiv Kumar

TL;DR
This paper proves that Transformer models are universal approximators for a broad class of sequence-to-sequence functions, revealing their expressive power and the distinct roles of self-attention and feed-forward layers.
Contribution
The paper establishes the universal approximation capabilities of Transformers for continuous sequence-to-sequence functions, with and without positional encodings, and analyzes the roles of different layers.
Findings
Transformers can approximate any continuous permutation equivariant sequence-to-sequence function.
Positional encodings enable approximation of arbitrary sequence-to-sequence functions.
Self-attention layers are crucial for contextual mappings in Transformers.
Abstract
Despite the widespread adoption of Transformer models for NLP tasks, the expressive power of these models is not well-understood. In this paper, we establish that Transformer models are universal approximators of continuous permutation equivariant sequence-to-sequence functions with compact support, which is quite surprising given the amount of shared parameters in these models. Furthermore, using positional encodings, we circumvent the restriction of permutation equivariance, and show that Transformer models can universally approximate arbitrary continuous sequence-to-sequence functions on a compact domain. Interestingly, our proof techniques clearly highlight the different roles of the self-attention and the feed-forward layers in Transformers. In particular, we prove that fixed width self-attention layers can compute contextual mappings of the input sequences, playing a key role in…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Software Engineering Research
MethodsLinear Layer · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Residual Connection · Byte Pair Encoding · Dense Connections · Label Smoothing · *Communicated@Fast*How Do I Communicate to Expedia? · Adam · Softmax
