Gradient Dynamics of Attention: How Cross-Entropy Sculpts Bayesian Manifolds

Naman Agarwal; Siddhartha R. Dalal; Vishal Misra

arXiv:2512.22473·stat.ML·May 19, 2026

Gradient Dynamics of Attention: How Cross-Entropy Sculpts Bayesian Manifolds

Naman Agarwal, Siddhartha R. Dalal, Vishal Misra

PDF

TL;DR

This paper analyzes how cross-entropy training shapes the internal geometry of transformer attention mechanisms, revealing a coupled routing and specialization process that implements Bayesian inference.

Contribution

It provides a first-order analytical framework explaining how gradient dynamics induce Bayesian manifold structures in transformer attention heads.

Findings

01

Attention scores follow an advantage-based routing law.

02

Values are updated via a responsibility-weighted rule resembling an EM algorithm.

03

Gradient dynamics sculpt low-dimensional Bayesian manifolds supporting probabilistic reasoning.

Abstract

Transformers empirically perform precise probabilistic reasoning in carefully constructed ``Bayesian wind tunnels'' and in large-scale language models, yet the mechanisms by which gradient-based learning creates the required internal geometry remain opaque. We provide a complete first-order analysis of how cross-entropy training reshapes attention scores and value vectors in a transformer attention head. Our core result is an \emph{advantage-based routing law} for attention scores, \[ \frac{\partial L}{\partial s_{ij}} = \alpha_{ij}\bigl(b_{ij}-\mathbb{E}_{\alpha_i}[b]\bigr), \qquad b_{ij} := u_i^\top v_j, \] coupled with a \emph{responsibility-weighted update} for values, \[ \Delta v_j = -\eta\sum_i \alpha_{ij} u_i, \] where $u_{i}$ is the upstream gradient at position $i$ and $α_{ij}$ are attention weights. These equations induce a positive feedback loop in which routing and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Topic Modeling · Advanced Graph Neural Networks