Energy-Based Transformers are Scalable Learners and Thinkers

Alexi Gladstone; Ganesh Nanduru; Md Mofijul Islam; Peixuan Han; Hyeonjeong Ha; Aman Chadha; Yilun Du; Heng Ji; Jundong Li; Tariq Iqbal

arXiv:2507.02092·cs.LG·July 4, 2025

Energy-Based Transformers are Scalable Learners and Thinkers

Alexi Gladstone, Ganesh Nanduru, Md Mofijul Islam, Peixuan Han, Hyeonjeong Ha, Aman Chadha, Yilun Du, Heng Ji, Jundong Li, Tariq Iqbal

PDF

1 Repo 2 Videos 3 Reviews

TL;DR

This paper introduces Energy-Based Transformers (EBTs), a novel class of models that learn to verify input-prediction compatibility through unsupervised energy minimization, enabling scalable and generalizable thinking across modalities.

Contribution

The paper proposes EBTs, a new energy-based model class that scales faster and generalizes better than traditional transformers by learning to verify predictions via unsupervised energy minimization.

Findings

01

EBTs scale 35% faster than Transformer++ during training.

02

EBTs improve language task performance by 29% with System 2 Thinking.

03

EBTs outperform Diffusion Transformers on image denoising with fewer passes.

Abstract

Inference-time computation techniques, analogous to human System 2 Thinking, have recently become popular for improving model performances. However, most existing approaches suffer from several limitations: they are modality-specific (e.g., working only in text), problem-specific (e.g., verifiable domains like math and coding), or require additional supervision/training on top of unsupervised pretraining (e.g., verifiers or verifiable rewards). In this paper, we ask the question "Is it possible to generalize these System 2 Thinking approaches, and develop models that learn to think solely from unsupervised learning?" Interestingly, we find the answer is yes, by learning to explicitly verify the compatibility between inputs and candidate-predictions, and then re-framing prediction problems as optimization with respect to this verifier. Specifically, we train Energy-Based Transformers…

Peer Reviews

Decision·ICLR 2026 Oral

Reviewer 01Rating 6Confidence 3

Strengths

- I think the core idea, integrating self-verification in inference time, is promising. - The empirical results verify the effectiveness of the proposed model.

Weaknesses

- Although this paper is overall clear, I think the authors have spent too much space introducing related concepts, instead of the technical details of the proposed method. It is unclear how the model specifically works. It would be good if there could be an extra section to explain the details of the model, perhaps through a toy example. - It is unclear why "frame EBM as an optimization problem" can avoid the curse of dimensionality. All evidences provided in the paper are just vague discussion

Reviewer 02Rating 8Confidence 3

Strengths

The core contribution is genuinely novel and conceptually elegant. The insight that verification is easier than generation (grounded in complexity theory) is well-motivated, and coupling the verifier and generator through energy gradients avoids adversarial dynamics. The cross-modal validation across language, video, and images demonstrates generality beyond domain-specific tricks. Most compellingly, Figure 6 showing that thinking gains increase with distributional shift mirrors human cognition

Weaknesses

The scale limitations severely undermine the paper's claims. All experiments max out at 800M parameters while making assertions about foundation model behavior through extrapolation. The paper uses extrapolation to larger sized for many claims, which is speculative given that scaling laws often break at different regimes. Without validation the central claims about foundation model potential remain unsubstantiated. The computational costs (3.33-6.66× FLOPs for training, gradient computation over

Reviewer 03Rating 8Confidence 3

Strengths

- This paper is well written with clear motivation analysis, comprehensive experimental design, and informative figures that effectively support the main arguments. - The core insight of building generation models based on the principle that verification is easier than generation is novel and insightful. This approach provides a fresh perspective on how to design generative models by leveraging verification capabilities. - The dynamic computation allocation based on problem difficulty is an int

Weaknesses

- The verification approach in this work does not align with intuitive expectations. For language tasks, meaningful verification should operate on complete sentences or full responses to questions. However, EBT uses autoregressive generation where each token is verified independently during generation. This raises questions about the validity of verification at the token level rather than at the semantic level of complete thoughts or responses. - This work lacks sufficient analysis of computati

Code & Models

Repositories

alexiglad/EBT
pytorchOfficial

Videos

Energy-Based Transformers are Scalable Learners and Thinkers (Paper Review)· youtube

Energy-Based Transformers explained | How EBTs and EBMs work· youtube

Taxonomy

MethodsDiffusion