Transformers are Universal Predictors

Sourya Basu; Moulik Choraria; Lav R. Varshney

arXiv:2307.07843·cs.LG·July 18, 2023·2 cites

Transformers are Universal Predictors

Sourya Basu, Moulik Choraria, Lav R. Varshney

PDF

Open Access

TL;DR

This paper demonstrates that Transformers possess a universal prediction capability in an information-theoretic sense and analyzes their performance limits, especially in data-efficient training scenarios, supported by theoretical and experimental validation.

Contribution

It establishes the universal prediction property of Transformers and analyzes their performance limits and component roles in data-efficient regimes.

Findings

01

Transformers have a universal prediction property.

02

Performance limits are characterized in non-asymptotic regimes.

03

Experimental validation confirms theoretical insights.

Abstract

We find limits to the Transformer architecture for language modeling and show it has a universal prediction property in an information-theoretic sense. We further analyze performance in non-asymptotic data regimes to understand the role of various components of the Transformer architecture, especially in the context of data-efficient training. We validate our theoretical analysis with experiments on both synthetic and real datasets.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Speech Recognition and Synthesis

MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Byte Pair Encoding · Position-Wise Feed-Forward Layer · Label Smoothing · Residual Connection · Absolute Position Encodings · Adam · Layer Normalization