Dissecting the Interplay of Attention Paths in a Statistical Mechanics Theory of Transformers
Lorenzo Tiberi, Francesca Mignacco, Kazuki Irie, Haim Sompolinsky

TL;DR
This paper develops a statistical mechanics theory for a tractable deep attention model, revealing how attention paths and kernel combinations influence generalization, with experimental validation and implications for network pruning.
Contribution
It introduces an analytically solvable model linking attention path interactions to kernel-based learning and generalization in Transformers.
Findings
Predictor statistics as a sum of attention path kernels
Kernel combination aligns with task labels, improving generalization
Pruning less relevant attention heads based on theory
Abstract
Despite the remarkable empirical performance of Transformers, their theoretical understanding remains elusive. Here, we consider a deep multi-head self-attention network, that is closely related to Transformers yet analytically tractable. We develop a statistical mechanics theory of Bayesian learning in this model, deriving exact equations for the network's predictor statistics under the finite-width thermodynamic limit, i.e., , , where is the network width and is the number of training examples. Our theory shows that the predictor statistics are expressed as a sum of independent kernels, each one pairing different 'attention paths', defined as information pathways through different attention heads across layers. The kernels are weighted according to a 'task-relevant kernel combination' mechanism that aligns the total kernel with the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsNeural Networks and Applications
MethodsPruning
