Optimal Control for Transformer Architectures: Enhancing Generalization, Robustness and Efficiency
Kelvin Kan, Xingjian Li, Benjamin J. Zhang, Tuhin Sahai, Stanley Osher, Markos A. Katsoulakis

TL;DR
This paper introduces an optimal control theory framework for Transformers that enhances their performance, robustness, and efficiency, with theoretical guarantees and seamless integration into existing models.
Contribution
It is the first work applying optimal control theory to both training and architecture of Transformers, providing a systematic, theory-driven approach.
Findings
46% reduction in test loss on nanoGPT with 42% fewer parameters
9.3% reduction in test loss on GPT-2
Improved generalization and robustness of Transformer models
Abstract
We study Transformers through the perspective of optimal control theory, using tools from continuous-time formulations to derive actionable insights into training and architecture design. This framework improves the performance of existing Transformer models while providing desirable theoretical guarantees, including generalization and robustness. Our framework is designed to be plug-and-play, enabling seamless integration with established Transformer models and requiring only slight changes to the implementation. We conduct seven extensive experiments on tasks motivated by text generation, sentiment analysis, image classification, and point cloud classification. Experimental results show that the framework improves the test performance of the baselines, while being more parameter-efficient. On character-level text generation with nanoGPT, our framework achieves a 46% reduction in final…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsMagnetic Properties and Applications · Advanced DC-DC Converters · Power Transformer Diagnostics and Insulation
MethodsAttention Is All You Need · Cosine Annealing · Linear Warmup With Cosine Annealing · Attention Dropout · Refunds@Expedia|||How do I get a full refund from Expedia? · Linear Layer · Byte Pair Encoding · Weight Decay · Label Smoothing · Dropout
