Standard Transformers Achieve the Minimax Rate in Nonparametric Regression with $C^{s,\lambda}$ Targets
Yanming Lai, Defeng Sun

TL;DR
This paper proves that standard Transformer models can approximate H"older functions with arbitrary precision and achieve the minimax optimal rate in nonparametric regression, providing theoretical justification for their effectiveness.
Contribution
It establishes the approximation capabilities of standard Transformers for H"older functions and demonstrates they attain the minimax rate in nonparametric regression, with a detailed structural characterization.
Findings
Transformers can approximate H"older functions under $L^t$ distance.
Transformers achieve the minimax optimal rate in nonparametric regression.
Derived bounds on Lipschitz constants and memorization capacity of Transformers.
Abstract
The tremendous success of Transformer models in fields such as large language models and computer vision necessitates a rigorous theoretical investigation. To the best of our knowledge, this paper is the first work proving that standard Transformers can approximate H\"older functions under the distance () with arbitrary precision. Building upon this approximation result, we demonstrate that standard Transformers achieve the minimax optimal rate in nonparametric regression for H\"older target functions. It is worth mentioning that, by introducing two metrics: the size tuple and the dimension vector, we provide a fine-grained characterization of Transformer structures, which facilitates future research on the generalization and optimization errors of Transformers with…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsStochastic Gradient Optimization Techniques · Domain Adaptation and Few-Shot Learning · Generative Adversarial Networks and Image Synthesis
