Transformers as Intrinsic Optimizers: Forward Inference through the Energy Principle
Ruifeng Ren, Sheng Ouyang, Huayi Tang, Yong Liu

TL;DR
This paper introduces an energy-based framework to understand and modify attention mechanisms in Transformers, connecting classical optimization algorithms with new attention structures, and providing insights into their underlying principles.
Contribution
It presents a unified energy-based perspective on attention in Transformers and proposes novel attention mechanisms inspired by classical optimization algorithms.
Findings
Different attention forms can be derived from the energy framework.
Energy-based modifications lead to new attention structures.
Preliminary experiments support the framework's potential.
Abstract
Attention-based Transformers have demonstrated strong adaptability across a wide range of tasks and have become the backbone of modern Large Language Models (LLMs). However, their underlying mechanisms remain open for further exploration. The energy-based perspective has long provided a valuable principle for understanding neural computation. In this paper, we revisit the principle of energy as a lens to understand attention-based Transformer models. We present a unified energy-based framework which is composed of three key components: the local energy , the global energy , and the employed optimization algorithms. We show that different attention forms including unnormalized linear attention, gated linear attention and standard softmax attention can be induced by choosing their corresponding recipes within this framework. Building on this framework, we propose energy-based…
Peer Reviews
Decision·ICLR 2026 Conference Withdrawn Submission
- The proposed framework is well presented with a clear derivation. - The proposed energy minimization perspective is integrated with the backward process through an alternating optimization process, which can be helpful to use the energy minimization perspective to understand the training process of transformers.
- There have been many works that use the energy minimization perspective to understand the forward process of transformers (e.g. [1]). The authors mentioned that " Although these studies establish certain connections between energy and Transformers, the design of energy functions is often not straightforward and lacks a unified framework to understand...". However, I don't see how this work is more straightforward or unified than the existing ones. - In the derivation of the energy function, th
The paper formulates **attention as an energy-minimization process**. This approach is **innovative, and some of the derivations are elegant**. The paper also **shows that dot-product attention is a special case of their proposed energy-based formulation**. It's always nice to see new perspectives on such an important mechanism of today. The paper also **attempts to empirically validate their energy-minimization approach**, with at least preliminary support.
Unfortunately, as detailed in the summary: **The theoretical/design formulation has key gaps** : *stability issues* with the proposed optimization, *no stability analysis for the temperature parameter,* and *lacking analysis for soundness of approximations* used. **Limited applicability of the link with dot-product attention**: Dot-product attention is shown as a special case of the proposed energy-based formulation *only under very rigid constraints*. Although it's mentioned these could be
The paper presents a clear and unified theoretical framework connecting Transformer attention mechanisms with energy minimization principles. The Helmholtz free energy perspective provides an interpretable physical analogy for attention updates. The derivation of both first- and second-order variants is mathematically sound and highlights the potential for curvature-aware improvements. The multi-head extension is conceptually consistent and technically elegant. The presentation is clear, with go
1. **Limited experimental scope** The paper validates the proposed framework only on a synthetic task (Longest Increasing Subsequence, LIS). It lacks evaluations on realistic NLP or vision benchmarks such as language modeling or image classification, leaving the practical effectiveness and generalization ability unclear. 2. **Small performance gains** The improvements reported on the LIS task are modest, without statistical significance analysis or comparisons against stronger baselines (e.g.
**Novel unified optimization perspective:** this paper proposes a novel unified framework to interpret the forward inference and backward inference as jointly minimizing the free energy using alternating gradient descent, bridging the gap between energy-based modeling and deep network dynamics. **Theory inspired algorithm** The motivation behind the proposed architectures is clear since it arises naturally from the theoretical derivation in Section 4. Moreover, the idea of preconditioning token
**Preliminary empirical validation** The experiments only focus on *Longest Increasing Subsequence task* which is a bit weak. It would be better to see how the proposed transformers perform on more realistic language tasks. **Strong assumptions on Theorem1** The interpretation and Theorem 1 rely on the assumption that $\|\|z\|\|=\|\|Wh\|\|$ and $\eta T W_Q^\top W_K=W_V$. This assumption seems to only hold with layer normalization. Can the authors provide some other examples where this assumptio
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Machine Learning in Materials Science · Multimodal Machine Learning Applications
