Approximation Error Upper and Lower Bounds for H\"{o}lder Class with Transformers
Xin He, Yuling Jiao, Xiliang Lu, Jerry Zhijian Yang

TL;DR
This paper establishes precise upper and lower bounds on the approximation error of Transformers for H"{o}lder functions, revealing their theoretical expressive power and limitations.
Contribution
It provides the first rigorous proof of both upper and lower bounds on the number of Transformer blocks needed for approximation, extending to regression tasks.
Findings
Transformers can approximate H"{o}lder functions with a number of blocks depending on accuracy and input dimension.
Lower bounds show Transformers require at least a certain number of blocks for a given accuracy.
Results extend to regression, demonstrating practical effectiveness.
Abstract
We explore the expressive power of Transformers by establishing precise approximation error upper and lower bounds for H\"{o}lder class. Specifically, a new approximation upper bound is derived for the standard Transformer architecture equipped with Softmax operators, ReLU activation functions, and residual connections. We prove that a Transformer network composed of at most blocks can approximate any bounded H\"{o}lder function with -dimensional input and smoothness under any accuracy . In the case of approximation lower bounds, leveraging the VC-dimension upper bound, we are the first to rigorously prove that Transformers demand for at least blocks to achieve the approximation accuracy. As a final step, we extend the derived results for…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
