Free Energy Mixer
Jiecheng Lu, Shihao Yang

TL;DR
The paper introduces the Free Energy Mixer (FEM), a novel attention mechanism that adaptively balances averaging and channel-wise selection, improving performance across NLP, vision, and time-series tasks without increasing complexity.
Contribution
FEM is a new attention read method that applies a value-driven, per-channel log-linear tilt, enabling smooth transition from averaging to selection while maintaining computational efficiency.
Findings
Outperforms strong baselines on NLP, vision, and time-series tasks.
Works with standard and linear attention models.
Maintains original asymptotic complexity.
Abstract
Standard attention stores keys/values losslessly but reads them via a per-head convex average, blocking channel-wise selection. We propose the Free Energy Mixer (FEM): a free-energy (log-sum-exp) read that applies a value-driven, per-channel log-linear tilt to a fast prior (e.g., from queries/keys in standard attention) over indices. Unlike methods that attempt to improve and enrich the scoring distribution, FEM treats it as a prior and yields a value-aware posterior read at unchanged complexity, smoothly moving from averaging to per-channel selection as the learnable inverse temperature increases, while still preserving parallelism and the original asymptotic complexity ( for softmax; for linearizable variants). We instantiate a two-level gated FEM that is plug-and-play with standard and linear attention, linear RNNs and SSMs. It consistently outperforms strong…
Peer Reviews
Decision·ICLR 2026 Poster
- To the best of my knowledge, the authors introduce a novel technique for mixing the value vectors in the attention block, enriching the expressivity of the operation. This approach can be applied to both attention-based models, as well as recurrent variations. - The authors do extensive evaluation on reasonable model/training budgets (1.3B parameters/100B tokens, 340M parameters/15B tokens), covering language modelling, image modelling, and time series forecasting. - The authors account for th
- I am not entirely convinced that the "lossy mixing" of the classical attention is obviously a significant limitation. While the theoretical justification seems compelling, it would be interesting to understand how different the proposed mixing ends up being from the standard attention after training. Nonetheless, this does not detract from the paper’s contribution, as the empirical results indicate consistent benchmark improvements. - The proposed method appears to add some constant computatio
The paper makes a solid contribution to the field. **Originality**: The primary contribution is the principled reframing of the attention readout as a "variational free-energy optimization" (Section 2, Contribution 2). This provides a novel, value-aware alternative to the standard convex-combination readout. The "Linearized Temperature Learning" (Section 2.3.1) is also a clever technique for maintaining efficiency. The originality is slightly tempered by prior work on LSE-based mechanisms, suc
1. **Practical Efficiency and Latency**: The paper's primary weakness is the gap between theoretical and practical efficiency. While FEM preserves "asymptotic complexity" (Abstract) , Table 5 shows a significant practical latency cost: "FEM-SM" (0.017s) is substantially slower than the baseline Transformer ("FEM-SM (-G,T,L,C)", 0.012s) in the forward pass. The authors' "Limitation" (Section 4) section admits this is due to a "lack of fused CUDA kernels" (Section 4). This is a critical barrier f
Motivation and Problem Identification: The paper compellingly argues that the standard per-head convex combination of values creates a "lossy read" bottleneck, preventing channel-wise selection. This is a well-articulated and non-trivial insight into a potential limitation of standard attention mechanisms. Theoretical and Empirical Rigor: The proposed FEM method is grounded in a solid theoretical framework derived from the Donsker-Varadhan variational principle. The paper is supported by thorou
1. **Baseline Comparisons:** While the empirical results are comprehensive, the baselines could be strengthened by including more recent architectures that also introduce channel-wise inductive biases. A comparison against methods like Mamba (with its data-dependent SSM) or Multi-head Latent Attention (MLA), which inherently manipulate channel interactions, would provide a more rigorous assessment of FEM's unique contribution beyond simply adding channel-specific gating or modulation. 2. **Cor
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenerative Adversarial Networks and Image Synthesis · Advanced Neural Network Applications · Stochastic Gradient Optimization Techniques
