Geometric Analysis of Token Selection in Multi-Head Attention

Timur Mudarisov; Mikhal Burtsev; Tatiana Petrova; Radu State

arXiv:2602.01893·cs.AI·February 3, 2026

Geometric Analysis of Token Selection in Multi-Head Attention

Timur Mudarisov, Mikhal Burtsev, Tatiana Petrova, Radu State

PDF

Open Access 3 Reviews

TL;DR

This paper introduces a geometric framework for analyzing multi-head attention in large language models, providing theoretical bounds and empirical validation for token selection behavior.

Contribution

It offers a novel geometric perspective on attention, deriving bounds and metrics that quantify token separability and interpretability in LLMs.

Findings

01

Top-N selection enhances token separability.

02

Sink similarity influences recall performance.

03

LLaMA-2 heads exhibit three distinct regimes.

Abstract

We present a geometric framework for analysing multi-head attention in large language models (LLMs). Without altering the mechanism, we view standard attention through a top-N selection lens and study its behaviour directly in value-state space. We define geometric metrics - Precision, Recall, and F-score - to quantify separability between selected and non-selected tokens, and derive non-asymptotic bounds with explicit dependence on dimension and margin under empirically motivated assumptions (stable value norms with a compressed sink token, exponential similarity decay, and piecewise attention weight profiles). The theory predicts a small-N operating regime of strongest non-trivial separability and clarifies how sequence length and sink similarity shape the metrics. Empirically, across LLaMA-2-7B, Gemma-7B, and Mistral-7B, measurements closely track the theoretical envelopes: top-N…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 6Confidence 4

Strengths

1) The geometric framing is clear and easy to follow, giving an intuitive picture of how attention works. 2) The theory lines up closely with the empirical data across different models. 3) The head taxonomy (Retriever, Mixer, Reset) offers a concrete and interpretable way to describe functional differences between heads. 4) The paper doesn’t rely on extra training or architectural changes, making the analysis generally applicable.

Weaknesses

(1) Limited practical connection. The paper gives a clean geometric description of token selection and supports it with strong empirical evidence. However, it stops short of connecting these findings to model performance or design improvements. The results are insightful but remain mainly diagnostic, without demonstrating benefits such as better alignment, loss reduction, or architectural efficiency. (2) Assumption sensitivity. The theoretical derivations rely on several empirical assumptions —

Reviewer 02Rating 4Confidence 3

Strengths

The authors provide a novel view of multi-head attention with nice theoretical support. Using the geometric perspective, they interpret the head functions in multi-head attention. The findings on value norms (assumption 1) are interesting.

Weaknesses

My major concerns towards this paper are about the validity of several assumptions in this paper, and the possible actionable suggestions on the LLM/attention community. 1. About assumption 2, I am a little bit concerned whether it makes sense to use an exponential function (between 0 and 1) to model a cosine similarity (ranged from -1 to 1). Although the authors show that the MAE error is very small, I am wondering whether it makes sense to assume that the cosine similarity could be always non

Reviewer 03Rating 2Confidence 3

Strengths

- The theoretical framing is novel and gives an interesting geometric perspective on attention, with clear analytical predictions that can be empirically checked. - The observation that attention sinks are not mere no-ops, but instead play an active role (especially in Recall and Reset-type heads), is particularly interesting and could motivate deeper investigation into sink dynamics and normalization mechanisms in large models.

Weaknesses

- The paper is not particularly well presented. Key motivations are unclear: the authors do not sufficiently explain why attention should be modeled as a classifier or why geometric separability in value space is the right lens for interpretability. The classification framework feels somewhat imposed rather than naturally derived from prior literature or empirical necessity. - Several important concepts (e.g. the “MAE” mentioned in the text) are never properly defined or justified in the context

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNeurobiology of Language and Bilingualism · Explainable Artificial Intelligence (XAI) · Ferroelectric and Negative Capacitance Devices