Multi-Head Attention as Ensemble Nadaraya-Watson Estimation: Variance Reduction, Decorrelation, and Optimal Head Diversity
Ernest Fokou\'e

TL;DR
This paper provides a rigorous statistical framework for understanding multi-head attention as an ensemble of Nadaraya-Watson estimators, highlighting the importance of head decorrelation and optimal head dimension allocation for variance reduction and model performance.
Contribution
It introduces a theoretical analysis of multi-head attention as an ensemble, defines the Head Diversity Index, and derives optimal head and dimension configurations based on data and budget constraints.
Findings
Variance reduction depends on head decorrelation, not just number of heads.
Orthogonal projections maximize variance reduction, aligned projections do not.
Optimal head dimension grows logarithmically with training set size, number of heads grows nearly linearly with total budget.
Abstract
We develop a rigorous statistical theory of multi-head attention (MHA) as an ensemble of Nadaraya-Watson (NW) kernel regression estimators. Building on the algebraic identity between single-head softmax attention and the NW estimator, we prove that MHA is a structured ensemble of H NW estimators, each operating in a distinct learned projection subspace of the key space. We derive an explicit Bias-Variance-Covariance decomposition of the MHA mean squared error, showing that variance reduction depends not merely on the number of heads H but fundamentally on the decorrelation of head outputs. Decorrelation is governed by the principal angles between learned projection subspaces: orthogonal projections yield maximum variance reduction; aligned projections yield none. We introduce the Head Diversity Index (HDI), a computable spectral measure of inter-head decorrelation, and prove that MHA…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
