TL;DR
This paper introduces Energy-Based Transformers (EBTs), a novel class of models that learn to verify input-prediction compatibility through unsupervised energy minimization, enabling scalable and generalizable thinking across modalities.
Contribution
The paper proposes EBTs, a new energy-based model class that scales faster and generalizes better than traditional transformers by learning to verify predictions via unsupervised energy minimization.
Findings
EBTs scale 35% faster than Transformer++ during training.
EBTs improve language task performance by 29% with System 2 Thinking.
EBTs outperform Diffusion Transformers on image denoising with fewer passes.
Abstract
Inference-time computation techniques, analogous to human System 2 Thinking, have recently become popular for improving model performances. However, most existing approaches suffer from several limitations: they are modality-specific (e.g., working only in text), problem-specific (e.g., verifiable domains like math and coding), or require additional supervision/training on top of unsupervised pretraining (e.g., verifiers or verifiable rewards). In this paper, we ask the question "Is it possible to generalize these System 2 Thinking approaches, and develop models that learn to think solely from unsupervised learning?" Interestingly, we find the answer is yes, by learning to explicitly verify the compatibility between inputs and candidate-predictions, and then re-framing prediction problems as optimization with respect to this verifier. Specifically, we train Energy-Based Transformers…
Peer Reviews
Decision·ICLR 2026 Oral
- I think the core idea, integrating self-verification in inference time, is promising. - The empirical results verify the effectiveness of the proposed model.
- Although this paper is overall clear, I think the authors have spent too much space introducing related concepts, instead of the technical details of the proposed method. It is unclear how the model specifically works. It would be good if there could be an extra section to explain the details of the model, perhaps through a toy example. - It is unclear why "frame EBM as an optimization problem" can avoid the curse of dimensionality. All evidences provided in the paper are just vague discussion
The core contribution is genuinely novel and conceptually elegant. The insight that verification is easier than generation (grounded in complexity theory) is well-motivated, and coupling the verifier and generator through energy gradients avoids adversarial dynamics. The cross-modal validation across language, video, and images demonstrates generality beyond domain-specific tricks. Most compellingly, Figure 6 showing that thinking gains increase with distributional shift mirrors human cognition
The scale limitations severely undermine the paper's claims. All experiments max out at 800M parameters while making assertions about foundation model behavior through extrapolation. The paper uses extrapolation to larger sized for many claims, which is speculative given that scaling laws often break at different regimes. Without validation the central claims about foundation model potential remain unsubstantiated. The computational costs (3.33-6.66× FLOPs for training, gradient computation over
- This paper is well written with clear motivation analysis, comprehensive experimental design, and informative figures that effectively support the main arguments. - The core insight of building generation models based on the principle that verification is easier than generation is novel and insightful. This approach provides a fresh perspective on how to design generative models by leveraging verification capabilities. - The dynamic computation allocation based on problem difficulty is an int
- The verification approach in this work does not align with intuitive expectations. For language tasks, meaningful verification should operate on complete sentences or full responses to questions. However, EBT uses autoregressive generation where each token is verified independently during generation. This raises questions about the validity of verification at the token level rather than at the semantic level of complete thoughts or responses. - This work lacks sufficient analysis of computati
Code & Models
Videos
Energy-Based Transformers are Scalable Learners and Thinkers (Paper Review)· youtube
Energy-Based Transformers explained | How EBTs and EBMs work· youtube
Taxonomy
MethodsDiffusion
