Revisiting Transformer Layer Parameterization Through Causal Energy Minimization
Jin Xu, Camille Couturier, Victor R\"uhle, Saravan Rajmohan, James Hensman

TL;DR
This paper introduces Causal Energy Minimization (CEM), a novel framework that interprets Transformer layers as energy-based optimization steps, enabling new layer design insights and stable training at moderate scales.
Contribution
CEM provides a new perspective on Transformer layer parameterization by framing them as energy minimization processes, connecting architectures to energy-based models and exploring new design spaces.
Findings
CEM-derived layers train stably at moderate scales.
They can match baseline Transformer performance despite constrained parameterizations.
CEM offers a new lens for understanding and designing Transformer architectures.
Abstract
Transformer blocks typically combine multi-head attention (MHA) for token mixing with gated MLPs for token-wise feature transformation, yet many choices in their parameterization remain largely empirical. We introduce Causal Energy Minimization (CEM), a framework that recasts Transformer layers as optimization steps on conditional energy functions while explicitly accounting for layer parameterization. Extending prior energy-based interpretations of attention, CEM shows that weight-tied MHA can be derived as a gradient update on an interaction energy, and that a gated MLP with shared up/down projections can be viewed through an element-wise energy. This perspective identifies a design space for Transformer layers that includes within-layer weight sharing, diagonal-plus-low-rank interactions, lightweight preconditioners, and recursive updates. We evaluate CEM-derived layers in…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
