Optimal Control for Transformer Architectures: Enhancing Generalization, Robustness and Efficiency

Kelvin Kan; Xingjian Li; Benjamin J. Zhang; Tuhin Sahai; Stanley Osher; Markos A. Katsoulakis

arXiv:2505.13499·cs.LG·October 27, 2025

Optimal Control for Transformer Architectures: Enhancing Generalization, Robustness and Efficiency

Kelvin Kan, Xingjian Li, Benjamin J. Zhang, Tuhin Sahai, Stanley Osher, Markos A. Katsoulakis

PDF

Open Access 1 Video

TL;DR

This paper introduces an optimal control theory framework for Transformers that enhances their performance, robustness, and efficiency, with theoretical guarantees and seamless integration into existing models.

Contribution

It is the first work applying optimal control theory to both training and architecture of Transformers, providing a systematic, theory-driven approach.

Findings

01

46% reduction in test loss on nanoGPT with 42% fewer parameters

02

9.3% reduction in test loss on GPT-2

03

Improved generalization and robustness of Transformer models

Abstract

We study Transformers through the perspective of optimal control theory, using tools from continuous-time formulations to derive actionable insights into training and architecture design. This framework improves the performance of existing Transformer models while providing desirable theoretical guarantees, including generalization and robustness. Our framework is designed to be plug-and-play, enabling seamless integration with established Transformer models and requiring only slight changes to the implementation. We conduct seven extensive experiments on tasks motivated by text generation, sentiment analysis, image classification, and point cloud classification. Experimental results show that the framework improves the test performance of the baselines, while being more parameter-efficient. On character-level text generation with nanoGPT, our framework achieves a 46% reduction in final…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Optimal Control for Transformer Architectures: Enhancing Generalization, Robustness and Efficiency· slideslive

Taxonomy

TopicsMagnetic Properties and Applications · Advanced DC-DC Converters · Power Transformer Diagnostics and Insulation

MethodsAttention Is All You Need · Cosine Annealing · Linear Warmup With Cosine Annealing · Attention Dropout · Refunds@Expedia|||How do I get a full refund from Expedia? · Linear Layer · Byte Pair Encoding · Weight Decay · Label Smoothing · Dropout