An Optimal Control Approach To Transformer Training

Ka\u{g}an Akman; Naci Sald{\i}; Serdar Y\"uksel

arXiv:2603.09571·cs.LG·March 11, 2026

An Optimal Control Approach To Transformer Training

Ka\u{g}an Akman, Naci Sald{\i}, Serdar Y\"uksel

PDF

Open Access

TL;DR

This paper introduces an optimal control framework for Transformer training that models the architecture as a controlled particle system, enabling globally optimal, robust training without gradient-based methods.

Contribution

It develops a novel optimal control approach using lifted MDPs and quantization, providing a globally optimal, stable, and gradient-free training method for Transformers.

Findings

01

Existence of globally optimal policies under mild conditions.

02

Proposed triply quantized training achieves near-optimality.

03

Model exhibits stability and empirical consistency.

Abstract

In this paper, we develop a rigorous optimal control-theoretic approach to Transformer training that respects key structural constraints such as (i) realized-input-independence during execution, (ii) the ensemble control nature of the problem, and (iii) positional dependence. We model the Transformer architecture as a discrete-time controlled particle system with shared actions, exhibiting noise-free McKean-Vlasov dynamics. While the resulting dynamics is not Markovian, we show that lifting it to probability measures produces a fully-observed Markov decision process (MDP). Positional encodings are incorporated into the state space to preserve the sequence order under lifting. Using the dynamic programming principle, we establish the existence of globally optimal policies under mild assumptions of compactness. We further prove that closed-loop policies in the lifted is equivalent to an…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsStochastic Gradient Optimization Techniques · Reinforcement Learning in Robotics · Robot Manipulation and Learning