Dissecting the Interplay of Attention Paths in a Statistical Mechanics   Theory of Transformers

Lorenzo Tiberi; Francesca Mignacco; Kazuki Irie; Haim Sompolinsky

arXiv:2405.15926·cs.LG·December 10, 2024·3 cites

Dissecting the Interplay of Attention Paths in a Statistical Mechanics Theory of Transformers

Lorenzo Tiberi, Francesca Mignacco, Kazuki Irie, Haim Sompolinsky

PDF

Open Access 1 Repo 1 Video

TL;DR

This paper develops a statistical mechanics theory for a tractable deep attention model, revealing how attention paths and kernel combinations influence generalization, with experimental validation and implications for network pruning.

Contribution

It introduces an analytically solvable model linking attention path interactions to kernel-based learning and generalization in Transformers.

Findings

01

Predictor statistics as a sum of attention path kernels

02

Kernel combination aligns with task labels, improving generalization

03

Pruning less relevant attention heads based on theory

Abstract

Despite the remarkable empirical performance of Transformers, their theoretical understanding remains elusive. Here, we consider a deep multi-head self-attention network, that is closely related to Transformers yet analytically tractable. We develop a statistical mechanics theory of Bayesian learning in this model, deriving exact equations for the network's predictor statistics under the finite-width thermodynamic limit, i.e., $N, P \to \infty$ , $P / N = O (1)$ , where $N$ is the network width and $P$ is the number of training examples. Our theory shows that the predictor statistics are expressed as a sum of independent kernels, each one pairing different 'attention paths', defined as information pathways through different attention heads across layers. The kernels are weighted according to a 'task-relevant kernel combination' mechanism that aligns the total kernel with the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

tiberilor/attention-paths-interplay
jaxOfficial

Videos

Dissecting the Interplay of Attention Paths in a Statistical Mechanics Theory of Transformers· slideslive

Taxonomy

TopicsNeural Networks and Applications

MethodsPruning