Universal Approximation Theorem for a Single-Layer Transformer
Esmail Gumaan

TL;DR
This paper establishes a universal approximation theorem for single-layer Transformers, demonstrating their capacity to approximate any continuous sequence-to-sequence function, thereby advancing theoretical understanding of these models.
Contribution
It provides the first formal proof that a single-layer Transformer with self-attention and ReLU can approximate any continuous sequence-to-sequence mapping.
Findings
Proves single-layer Transformers are universal approximators.
Provides formal mathematical foundation for Transformer capabilities.
Demonstrates practical implications through case studies.
Abstract
Deep learning employs multi-layer neural networks trained via the backpropagation algorithm. This approach has achieved success across many domains and relies on adaptive gradient methods such as the Adam optimizer. Sequence modeling evolved from recurrent neural networks to attention-based models, culminating in the Transformer architecture. Transformers have achieved state-of-the-art performance in natural language processing (for example, BERT and GPT-3) and have been applied in computer vision and computational biology. However, theoretical understanding of these models remains limited. In this paper, we examine the mathematical foundations of deep learning and Transformers and present a novel theoretical result. We review key concepts from linear algebra, probability, and optimization that underpin deep learning, and we analyze the multi-head self-attention mechanism and the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDigital Filter Design and Implementation · Sensor Technology and Measurement Systems · Non-Destructive Testing Techniques
