Gradient Dynamics of Attention: How Cross-Entropy Sculpts Bayesian Manifolds
Naman Agarwal, Siddhartha R. Dalal, Vishal Misra

TL;DR
This paper analyzes how cross-entropy training shapes the internal geometry of transformer attention mechanisms, revealing a coupled routing and specialization process that implements Bayesian inference.
Contribution
It provides a first-order analytical framework explaining how gradient dynamics induce Bayesian manifold structures in transformer attention heads.
Findings
Attention scores follow an advantage-based routing law.
Values are updated via a responsibility-weighted rule resembling an EM algorithm.
Gradient dynamics sculpt low-dimensional Bayesian manifolds supporting probabilistic reasoning.
Abstract
Transformers empirically perform precise probabilistic reasoning in carefully constructed ``Bayesian wind tunnels'' and in large-scale language models, yet the mechanisms by which gradient-based learning creates the required internal geometry remain opaque. We provide a complete first-order analysis of how cross-entropy training reshapes attention scores and value vectors in a transformer attention head. Our core result is an \emph{advantage-based routing law} for attention scores, \[ \frac{\partial L}{\partial s_{ij}} = \alpha_{ij}\bigl(b_{ij}-\mathbb{E}_{\alpha_i}[b]\bigr), \qquad b_{ij} := u_i^\top v_j, \] coupled with a \emph{responsibility-weighted update} for values, \[ \Delta v_j = -\eta\sum_i \alpha_{ij} u_i, \] where is the upstream gradient at position and are attention weights. These equations induce a positive feedback loop in which routing and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Topic Modeling · Advanced Graph Neural Networks
