Interleaved Head Attention

Sai Surya Duvvuri; Chanakya Ekbote; Rachit Bansal; Rishabh Tiwari; Devvrit Khatri; David Brandfonbrener; Paul Liang; Inderjit Dhillon; Manzil Zaheer

arXiv:2602.21371·cs.LG·February 26, 2026

Interleaved Head Attention

Sai Surya Duvvuri, Chanakya Ekbote, Rachit Bansal, Rishabh Tiwari, Devvrit Khatri, David Brandfonbrener, Paul Liang, Inderjit Dhillon, Manzil Zaheer

PDF

Open Access

TL;DR

Interleaved Head Attention (IHA) enhances multi-head attention by enabling cross-head communication, improving reasoning capabilities and efficiency in large language models, with demonstrated benefits on synthetic and real-world tasks.

Contribution

The paper introduces Interleaved Head Attention, a novel mechanism allowing cross-head mixing in multi-head attention to improve reasoning and parameter efficiency.

Findings

01

IHA outperforms standard MHA on synthetic polynomial and order-sensitive tasks.

02

IHA improves retrieval accuracy on RULER by 10-20%.

03

IHA enhances reasoning performance on GSM8K and MATH-500 datasets.

Abstract

Multi-Head Attention (MHA) is the core computational primitive underlying modern Large Language Models (LLMs). However, MHA suffers from a fundamental linear scaling limitation: $H$ attention heads produce exactly $H$ independent attention matrices, with no communication between heads during attention computation. This becomes problematic for multi-step reasoning, where correct answers depend on aggregating evidence from multiple parts of the context and composing latent token-to-token relations over a chain of intermediate inferences. To address this, we propose Interleaved Head Attention (IHA), which enables cross-head mixing by constructing $P$ pseudo-heads per head (typically $P = H$ ), where each pseudo query/key/value is a learned linear combination of all $H$ original queries, keys and values respectively. Interactions between pseudo-query and pseudo-key heads induce up to $P^{2}$ …

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Multimodal Machine Learning Applications · Natural Language Processing Techniques