Explicit Multi-head Attention for Inter-head Interaction in Large Language Models
Runyu Peng, Yunhua Zhou, Demin Song, Kai Lv, Bo Wang, Qipeng Guo, Xipeng Qiu

TL;DR
This paper introduces Multi-head Explicit Attention (MEA), a novel attention variant that explicitly models cross-head interactions, improving robustness, efficiency, and performance in large language models.
Contribution
The paper proposes MEA with HLC and normalization layers, enabling rich inter-head communication, faster convergence, and efficient KV-cache compression in large language models.
Findings
MEA improves attention performance and robustness.
Faster convergence with larger learning rates.
50% reduction in KV-cache memory with minimal performance loss.
Abstract
In large language models built upon the Transformer architecture, recent studies have shown that inter-head interaction can enhance attention performance. Motivated by this, we propose Multi-head Explicit Attention (MEA), a simple yet effective attention variant that explicitly models cross-head interaction. MEA consists of two key components: a Head-level Linear Composition (HLC) module that separately applies learnable linear combinations to the key and value vectors across heads, thereby enabling rich inter-head communication; and a head-level Group Normalization layer that aligns the statistical properties of the recombined heads. MEA shows strong robustness in pretraining, which allows the use of larger learning rates that lead to faster convergence, ultimately resulting in lower validation loss and improved performance across a range of tasks. Furthermore, we explore the parameter…
Peer Reviews
Decision·ICLR 2026 Conference Desk Rejected Submission
- Address an important bottleneck in LLMs which is attention for long contexts. - Formulation of generalization of prior inter head interaction methods like Differential Transformers and Talking Heads Attention is interesting.
- While the KV cache compression is targeting bottleneck for long contexts, there is no long context evaluation in the paper. - The full-parameter CPT setup reduces the performance of math benchmark baseline. This opens up the question whether math is more sensitive to compression or is it because CPT dataset needs to have more math data in it. It's hard to tell the reason from the data. - There is no comparison with other kv cache compression methods. - There is no validation for the learning
- Clear background / motivation / formulation, generalizing different types of attention: MQA and MHA, DFA and THA - Provides intution when and why each type of attention failes or degenerates into MHA - Evaluated on a wide range of challenging Math and resoning benchmarks
- The paper did not compare with other approaches that save on KV cache by continuous pretraining. - It wasn't clear to me why the proposed approach is more robust to KV compression than other approaches.
1. The most significant contribution of this work is the KV-cache compression strategy. A 50% reduction in KV-cache memory is a highly valuable engineering result, directly addressing one of the primary bottlenecks in long-context LLM inference. The fact that this is achieved with "negligible performance loss" on knowledge and science benchmarks (e.g., <1.5% drop on average) is very compelling. 2. The paper provides a clear and convincing argument for why MEA works. The authors demonstrate tha
1. The idea of inter-head interaction is not new, as the authors acknowledge by citing Talking-Heads Attention and Differential Transformer. The HLC module is a specific form of linear combination, and the GroupNorm component is directly inspired by DFA. The primary innovation is the specific combination of these ideas (pre-attention K/V mixing + post-attention GroupNorm) and the analysis of why this specific combination avoids the degeneration that plagued prior work. 2. While the training dyna
- Presents useful efforts to unify existing formulations, notably expressing variant of the Differential Transformer and Talking-Heads as special cases of MEA. Specifically, the work offers a new perspective on a DFA variant result, connecting insights across related work. - Results show improved pretraining convergence and performance with accuracy improvements for some standard benchmark datasets.
The key premise of the work is that inter-head interaction can enhance attention performance but the experimental results suggest that group normalization may play a bigger role than the mere inter-head communication. Since there are multiple somewhat orthogonal explorations going on, namely an assessment of MEA performance, an exploration of SVD-based efficiency improvements, as well as cost-efficient hyperparameter selection via scaling laws, the main empirical results do not seem to offer muc
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Topic Modeling · Machine Learning in Healthcare
