Revisiting Transformer Layer Parameterization Through Causal Energy Minimization

Jin Xu; Camille Couturier; Victor R\"uhle; Saravan Rajmohan; James Hensman

arXiv:2605.07588·cs.LG·May 11, 2026

Revisiting Transformer Layer Parameterization Through Causal Energy Minimization

Jin Xu, Camille Couturier, Victor R\"uhle, Saravan Rajmohan, James Hensman

PDF

TL;DR

This paper introduces Causal Energy Minimization (CEM), a novel framework that interprets Transformer layers as energy-based optimization steps, enabling new layer design insights and stable training at moderate scales.

Contribution

CEM provides a new perspective on Transformer layer parameterization by framing them as energy minimization processes, connecting architectures to energy-based models and exploring new design spaces.

Findings

01

CEM-derived layers train stably at moderate scales.

02

They can match baseline Transformer performance despite constrained parameterizations.

03

CEM offers a new lens for understanding and designing Transformer architectures.

Abstract

Transformer blocks typically combine multi-head attention (MHA) for token mixing with gated MLPs for token-wise feature transformation, yet many choices in their parameterization remain largely empirical. We introduce Causal Energy Minimization (CEM), a framework that recasts Transformer layers as optimization steps on conditional energy functions while explicitly accounting for layer parameterization. Extending prior energy-based interpretations of attention, CEM shows that weight-tied MHA can be derived as a gradient update on an interaction energy, and that a gated MLP with shared up/down projections can be viewed through an element-wise energy. This perspective identifies a design space for Transformer layers that includes within-layer weight sharing, diagonal-plus-low-rank interactions, lightweight preconditioners, and recursive updates. We evaluate CEM-derived layers in…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.