Learning Theory of Transformers: Local-to-Global Approximation via Softmax Partition of Unity

Zhongjie Shi; Wenjing Liao

arXiv:2605.08811·stat.ML·May 12, 2026

Learning Theory of Transformers: Local-to-Global Approximation via Softmax Partition of Unity

Zhongjie Shi, Wenjing Liao

PDF

TL;DR

This paper develops a theoretical framework for Transformers, showing they can efficiently approximate functions on Euclidean domains and manifolds, with implications for their generalization capabilities.

Contribution

It introduces a local-to-global approximation method for Transformers using softmax partition of unity, providing approximation and generalization guarantees.

Findings

01

Transformers can achieve uniform approximation of Hölder functions with minimal encoder blocks.

02

The approximation error scales with the number of parameters as $ ext{O}( ext{parameters}^{-d/ ext{alpha}})$.

03

The generalization error bound is near minimax-optimal, scaling as $ ext{O}(n^{-rac{2 ext{alpha}}{2 ext{alpha}+d}} ext{log} n)$.

Abstract

This paper investigates the learning theory of Transformer networks for regression tasks on the compact Euclidean domain $[0, 1]^{d}$ and $d$ -dimensional compact Riemannian manifolds. We propose a novel constructive approximation framework for Transformers that builds local approximations of the target function and aggregates them into a global approximation via softmax partition of unity. This approach leverages the attention mechanism to achieve spatial localization through affine transformations of the input. The softmax activation plays a crucial role in aggregating local approximations to a global output. From an approximation perspective, we prove that a dense Transformer equipped with only two encoder blocks and standard single-hidden-layer point-wise feed-forward networks can achieve a uniform $ε$ -approximation error for $α$ -H\"older continuous functions with $\alpha…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.