An Optimal Control Approach To Transformer Training
Ka\u{g}an Akman, Naci Sald{\i}, Serdar Y\"uksel

TL;DR
This paper introduces an optimal control framework for Transformer training that models the architecture as a controlled particle system, enabling globally optimal, robust training without gradient-based methods.
Contribution
It develops a novel optimal control approach using lifted MDPs and quantization, providing a globally optimal, stable, and gradient-free training method for Transformers.
Findings
Existence of globally optimal policies under mild conditions.
Proposed triply quantized training achieves near-optimality.
Model exhibits stability and empirical consistency.
Abstract
In this paper, we develop a rigorous optimal control-theoretic approach to Transformer training that respects key structural constraints such as (i) realized-input-independence during execution, (ii) the ensemble control nature of the problem, and (iii) positional dependence. We model the Transformer architecture as a discrete-time controlled particle system with shared actions, exhibiting noise-free McKean-Vlasov dynamics. While the resulting dynamics is not Markovian, we show that lifting it to probability measures produces a fully-observed Markov decision process (MDP). Positional encodings are incorporated into the state space to preserve the sequence order under lifting. Using the dynamic programming principle, we establish the existence of globally optimal policies under mild assumptions of compactness. We further prove that closed-loop policies in the lifted is equivalent to an…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsStochastic Gradient Optimization Techniques · Reinforcement Learning in Robotics · Robot Manipulation and Learning
