Attention Enables Zero Approximation Error
Zhiying Fang, Yidong Ouyang, Ding-Xuan Zhou, Guang Cheng

TL;DR
This paper proves that a fixed, untrained single-head self-attention transformer can generate any polynomial of the input, explaining its success and universality in deep learning.
Contribution
It introduces a theoretical framework showing that untrained, fixed-parameter transformers are universal approximators for polynomials, revealing their fundamental capabilities.
Findings
Transformer encoder blocks do not need training to generate polynomials.
Single-head self-attention transformers are capable of universal approximation.
The number of encoder blocks equals the degree of the polynomial.
Abstract
Deep learning models have been widely applied in various aspects of daily life. Many variant models based on deep learning structures have achieved even better performances. Attention-based architectures have become almost ubiquitous in deep learning structures. Especially, the transformer model has now defeated the convolutional neural network in image classification tasks to become the most widely used tool. However, the theoretical properties of attention-based models are seldom considered. In this work, we show that with suitable adaptations, the single-head self-attention transformer with a fixed number of transformer encoder blocks and free parameters is able to generate any desired polynomial of the input with no error. The number of transformer encoder blocks is the same as the degree of the target polynomial. Even more exciting, we find that these transformer encoder blocks in…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications · Domain Adaptation and Few-Shot Learning · Machine Learning and ELM
