YuriiFormer: A Suite of Nesterov-Accelerated Transformers

Aleksandr Zimin; Yury Polyanskiy; Philippe Rigollet

arXiv:2601.23236·cs.LG·March 6, 2026

YuriiFormer: A Suite of Nesterov-Accelerated Transformers

Aleksandr Zimin, Yury Polyanskiy, Philippe Rigollet

PDF

Open Access

TL;DR

YuriiFormer introduces a variational framework interpreting transformer layers as optimization steps, leading to a Nesterov-accelerated architecture that improves performance on language modeling tasks.

Contribution

The paper presents a novel optimization-inspired perspective on transformers and develops a Nesterov-accelerated variant that enhances language modeling performance.

Findings

01

Outperforms nanoGPT baseline on TinyStories.

02

Demonstrates practical benefits of optimization-theoretic design.

03

Introduces a principled architectural framework for transformers.

Abstract

We propose a variational framework that interprets transformer layers as iterations of an optimization algorithm acting on token embeddings. In this view, self-attention implements a gradient step of an interaction energy, while MLP layers correspond to gradient updates of a potential energy. Standard GPT-style transformers emerge as vanilla gradient descent on the resulting composite objective, implemented via Lie--Trotter splitting between these two energy functionals. This perspective enables principled architectural design using classical optimization ideas. As a proof of concept, we introduce a Nesterov-style accelerated transformer that preserves the same attention and MLP oracles. The resulting architecture consistently outperforms a nanoGPT baseline on TinyStories and OpenWebText, demonstrating that optimization-theoretic insights can translate into practical gains.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsQuantum Computing Algorithms and Architecture · Neural Networks and Reservoir Computing · Stochastic Gradient Optimization Techniques