Learning Theory of Transformers: Local-to-Global Approximation via Softmax Partition of Unity
Zhongjie Shi, Wenjing Liao

TL;DR
This paper develops a theoretical framework for Transformers, showing they can efficiently approximate functions on Euclidean domains and manifolds, with implications for their generalization capabilities.
Contribution
It introduces a local-to-global approximation method for Transformers using softmax partition of unity, providing approximation and generalization guarantees.
Findings
Transformers can achieve uniform approximation of Hölder functions with minimal encoder blocks.
The approximation error scales with the number of parameters as $ ext{O}( ext{parameters}^{-d/ ext{alpha}})$.
The generalization error bound is near minimax-optimal, scaling as $ ext{O}(n^{-rac{2 ext{alpha}}{2 ext{alpha}+d}} ext{log} n)$.
Abstract
This paper investigates the learning theory of Transformer networks for regression tasks on the compact Euclidean domain and -dimensional compact Riemannian manifolds. We propose a novel constructive approximation framework for Transformers that builds local approximations of the target function and aggregates them into a global approximation via softmax partition of unity. This approach leverages the attention mechanism to achieve spatial localization through affine transformations of the input. The softmax activation plays a crucial role in aggregating local approximations to a global output. From an approximation perspective, we prove that a dense Transformer equipped with only two encoder blocks and standard single-hidden-layer point-wise feed-forward networks can achieve a uniform -approximation error for -H\"older continuous functions with $\alpha…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
