Transformers are Universal Predictors
Sourya Basu, Moulik Choraria, Lav R. Varshney

TL;DR
This paper demonstrates that Transformers possess a universal prediction capability in an information-theoretic sense and analyzes their performance limits, especially in data-efficient training scenarios, supported by theoretical and experimental validation.
Contribution
It establishes the universal prediction property of Transformers and analyzes their performance limits and component roles in data-efficient regimes.
Findings
Transformers have a universal prediction property.
Performance limits are characterized in non-asymptotic regimes.
Experimental validation confirms theoretical insights.
Abstract
We find limits to the Transformer architecture for language modeling and show it has a universal prediction property in an information-theoretic sense. We further analyze performance in non-asymptotic data regimes to understand the role of various components of the Transformer architecture, especially in the context of data-efficient training. We validate our theoretical analysis with experiments on both synthetic and real datasets.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Speech Recognition and Synthesis
MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Byte Pair Encoding · Position-Wise Feed-Forward Layer · Label Smoothing · Residual Connection · Absolute Position Encodings · Adam · Layer Normalization
