Transformer-like Inference from Optimal Control

Aditya Kudre; Heng-Sheng Chang; Prashant G. Mehta

arXiv:2605.15608·cs.LG·May 18, 2026

Transformer-like Inference from Optimal Control

Aditya Kudre, Heng-Sheng Chang, Prashant G. Mehta

PDF

TL;DR

This paper derives transformer-like inference architectures from optimal control principles, providing a theoretical foundation and explicit algorithms that mirror transformer layer structures.

Contribution

It introduces a novel framework connecting transformers to optimal control theory, deriving inference algorithms with layer structures similar to transformers from first principles.

Findings

01

Optimal control-based inference algorithms match transformer layer structures.

02

Transformers implicitly exploit non-Markovian structure when embedding dimension is limited.

03

Numerical experiments compare optimal control solutions with trained transformer attention weights.

Abstract

Decoder-only transformers compute the conditional probability of the next token from a sequence of past observations. This paper derives, from first principles, inference architectures that solve the same prediction problem - and in doing so, recovers transformer-like layer operations as a consequence of optimal control theory. The framework is developed for two model classes: a nonlinear model of discrete-valued processes, directly motivated by the transformer, and a linear Gaussian model as a tractable baseline. For both model classes, the prediction objective is reformulated as an optimal control problem whose solution yields an explicit inference algorithm, the dual filter, with a layer structure that mirrors the layer structure of a decoder-only transformer. Numerical experiments provide a comparison of the optimal control to attention weights from a trained transformer. These…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.