YuriiFormer: A Suite of Nesterov-Accelerated Transformers
Aleksandr Zimin, Yury Polyanskiy, Philippe Rigollet

TL;DR
YuriiFormer introduces a variational framework interpreting transformer layers as optimization steps, leading to a Nesterov-accelerated architecture that improves performance on language modeling tasks.
Contribution
The paper presents a novel optimization-inspired perspective on transformers and develops a Nesterov-accelerated variant that enhances language modeling performance.
Findings
Outperforms nanoGPT baseline on TinyStories.
Demonstrates practical benefits of optimization-theoretic design.
Introduces a principled architectural framework for transformers.
Abstract
We propose a variational framework that interprets transformer layers as iterations of an optimization algorithm acting on token embeddings. In this view, self-attention implements a gradient step of an interaction energy, while MLP layers correspond to gradient updates of a potential energy. Standard GPT-style transformers emerge as vanilla gradient descent on the resulting composite objective, implemented via Lie--Trotter splitting between these two energy functionals. This perspective enables principled architectural design using classical optimization ideas. As a proof of concept, we introduce a Nesterov-style accelerated transformer that preserves the same attention and MLP oracles. The resulting architecture consistently outperforms a nanoGPT baseline on TinyStories and OpenWebText, demonstrating that optimization-theoretic insights can translate into practical gains.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsQuantum Computing Algorithms and Architecture · Neural Networks and Reservoir Computing · Stochastic Gradient Optimization Techniques
