Ordinary Least Squares is a Special Case of Transformer
Xiaojun Tan, Yuchen Zhao

TL;DR
This paper proves that the core of the Transformer architecture is mathematically equivalent to Ordinary Least Squares regression, revealing a statistical foundation for Transformers.
Contribution
It demonstrates that OLS is a special case of the Linear Transformer, providing a rigorous algebraic proof and connecting Transformers to classical statistical inference.
Findings
Attention can perform OLS in a single forward pass.
Identifies a decoupled slow and fast memory mechanism in Transformers.
Discusses the transition from linear prototypes to standard Transformers.
Abstract
The statistical essence of the Transformer architecture has long remained elusive: Is it a universal approximator, or a neural network version of known computational algorithms? Through rigorous algebraic proof, we show that the latter better describes Transformer's basic nature: Ordinary Least Squares (OLS) is a special case of the single-layer Linear Transformer. Using the spectral decomposition of the empirical covariance matrix, we construct a specific parameter setting where the attention mechanism's forward pass becomes mathematically equivalent to the OLS closed-form projection. This means attention can solve the problem in one forward pass, not by iterating. Building upon this prototypical case, we further uncover a decoupled slow and fast memory mechanism within Transformers. Finally, the evolution from our established linear prototype to standard Transformers is discussed.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
