Multi-Head Attention as Ensemble Nadaraya-Watson Estimation: Variance Reduction, Decorrelation, and Optimal Head Diversity

Ernest Fokou\'e

arXiv:2605.20271·stat.ML·May 21, 2026

Multi-Head Attention as Ensemble Nadaraya-Watson Estimation: Variance Reduction, Decorrelation, and Optimal Head Diversity

Ernest Fokou\'e

PDF

TL;DR

This paper provides a rigorous statistical framework for understanding multi-head attention as an ensemble of Nadaraya-Watson estimators, highlighting the importance of head decorrelation and optimal head dimension allocation for variance reduction and model performance.

Contribution

It introduces a theoretical analysis of multi-head attention as an ensemble, defines the Head Diversity Index, and derives optimal head and dimension configurations based on data and budget constraints.

Findings

01

Variance reduction depends on head decorrelation, not just number of heads.

02

Orthogonal projections maximize variance reduction, aligned projections do not.

03

Optimal head dimension grows logarithmically with training set size, number of heads grows nearly linearly with total budget.

Abstract

We develop a rigorous statistical theory of multi-head attention (MHA) as an ensemble of Nadaraya-Watson (NW) kernel regression estimators. Building on the algebraic identity between single-head softmax attention and the NW estimator, we prove that MHA is a structured ensemble of H NW estimators, each operating in a distinct learned projection subspace of the key space. We derive an explicit Bias-Variance-Covariance decomposition of the MHA mean squared error, showing that variance reduction depends not merely on the number of heads H but fundamentally on the decorrelation of head outputs. Decorrelation is governed by the principal angles between learned projection subspaces: orthogonal projections yield maximum variance reduction; aligned projections yield none. We introduce the Head Diversity Index (HDI), a computable spectral measure of inter-head decorrelation, and prove that MHA…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.