Transformers for Learning on Noisy and Task-Level Manifolds: Approximation and Generalization Insights
Zhaiming Shen, Alex Havrilla, Rongjie Lai, Alexander Cloninger, Wenjing Liao

TL;DR
This paper provides a theoretical analysis of transformers' ability to learn from noisy data near low-dimensional manifolds, revealing how they leverage intrinsic data structures for regression tasks.
Contribution
It establishes a theoretical foundation for understanding transformers' performance on noisy, manifold-structured data, highlighting their ability to exploit low-dimensional structures.
Findings
Transformers' approximation errors depend on the intrinsic dimension of the task manifold.
Transformers can effectively leverage low-complexity structures despite high-dimensional noise.
The paper introduces a novel proof technique constructing basic arithmetic operations with transformers.
Abstract
Transformers serve as the foundational architecture for large language and video generation models, such as GPT, BERT, SORA and their successors. Empirical studies have demonstrated that real-world data and learning tasks exhibit low-dimensional structures, along with some noise or measurement error. The performance of transformers tends to depend on the intrinsic dimension of the data/tasks, though theoretical understandings remain largely unexplored for transformers. This work establishes a theoretical foundation by analyzing the performance of transformers for regression tasks involving noisy input data near a manifold. Specifically, the input data are in a tubular neighborhood of a manifold, while the ground truth function depends on the projection of the noisy data onto this manifold, referred to as the task-level manifold. We prove approximation and generalization errors which…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNeural Networks and Applications · Model Reduction and Neural Networks · Gaussian Processes and Bayesian Inference
MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Attention Is All You Need · Linear Warmup With Linear Decay · Byte Pair Encoding · Attention Dropout · Softmax · Residual Connection · WordPiece · Linear Layer · Multi-Head Attention
