Multi-Head Attention Is a Multi-Player Game
Kushal Chakrabarti, Nirmal Balachundar

TL;DR
This paper models transformer attention heads as players in a game, analyzes how their interactions affect training efficiency, and proposes regularization methods to improve performance and reduce hallucinations.
Contribution
It formalizes the multi-head attention as a potential game, derives bounds on inefficiency, and introduces GAME-LoRA to improve training outcomes by reducing head redundancy.
Findings
PoA bound predicts hallucination likelihood
Emergent coalitions show selective coordination
GAME-LoRA reduces hallucinations by up to 18%
Abstract
Modern transformer attention is internally multi-agent -- heads compete and coordinate -- yet we train it as if it were a monolithic optimizer. We formalize this gap: cross-entropy training induces an implicit potential game among heads, and gradient descent converges to Nash equilibria with potentially unbounded inefficiency due to unpriced externalities (redundancy, correlated errors). Our main result bounds the Price of Anarchy by , the off-diagonal mass of a head interaction matrix capturing weight and gradient coupling. Under mild smoothness assumptions, we prove that both \emph{excess hallucination probability} and \emph{excess head redundancy} scale with PoA, unifying two distinct failure modes into a single mechanism. The bound is prescriptive: regularization that reduces provably tightens PoA. We instantiate this as GAME-LoRA, combining Barlow Twins…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsStochastic Gradient Optimization Techniques · Mobile Crowdsensing and Crowdsourcing · Advanced Graph Neural Networks
