Multi-Head Attention Is a Multi-Player Game

Kushal Chakrabarti; Nirmal Balachundar

arXiv:2602.00861·cs.AI·February 3, 2026

Multi-Head Attention Is a Multi-Player Game

Kushal Chakrabarti, Nirmal Balachundar

PDF

Open Access

TL;DR

This paper models transformer attention heads as players in a game, analyzes how their interactions affect training efficiency, and proposes regularization methods to improve performance and reduce hallucinations.

Contribution

It formalizes the multi-head attention as a potential game, derives bounds on inefficiency, and introduces GAME-LoRA to improve training outcomes by reducing head redundancy.

Findings

01

PoA bound predicts hallucination likelihood

02

Emergent coalitions show selective coordination

03

GAME-LoRA reduces hallucinations by up to 18%

Abstract

Modern transformer attention is internally multi-agent -- heads compete and coordinate -- yet we train it as if it were a monolithic optimizer. We formalize this gap: cross-entropy training induces an implicit potential game among heads, and gradient descent converges to Nash equilibria with potentially unbounded inefficiency due to unpriced externalities (redundancy, correlated errors). Our main result bounds the Price of Anarchy by $Γ (G)$ , the off-diagonal mass of a head interaction matrix capturing weight and gradient coupling. Under mild smoothness assumptions, we prove that both \emph{excess hallucination probability} and \emph{excess head redundancy} scale with PoA, unifying two distinct failure modes into a single mechanism. The bound is prescriptive: regularization that reduces $Γ (G)$ provably tightens PoA. We instantiate this as GAME-LoRA, combining Barlow Twins…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsStochastic Gradient Optimization Techniques · Mobile Crowdsensing and Crowdsourcing · Advanced Graph Neural Networks