MHLA: Restoring Expressivity of Linear Attention via Token-Level Multi-Head
Kewei Zhang, Ye Huang, Yufan Deng, Jincheng Yu, Junsong Chen, Huan Ling, Enze Xie, Daquan Zhou

TL;DR
This paper introduces MHLA, a multi-head linear attention mechanism that preserves diversity and expressivity in linear attention models, significantly improving performance across vision and language tasks without increasing computational complexity.
Contribution
The paper proposes MHLA, a novel multi-head linear attention method that maintains linear complexity and enhances expressivity by preventing global context collapse.
Findings
3.6% improvement on ImageNet classification
6.3% gain on NLP tasks
12.6% improvement on image generation
Abstract
While the Transformer architecture dominates many fields, its quadratic self-attention complexity hinders its use in large-scale applications. Linear attention offers an efficient alternative, but its direct application often degrades performance, with existing fixes typically re-introducing computational overhead through extra modules (e.g., depthwise separable convolution) that defeat the original purpose. In this work, we identify a key failure mode in these methods: global context collapse, where the model loses representational diversity. To address this, we propose Multi-Head Linear Attention (MHLA), which preserves this diversity by computing attention within divided heads along the token dimension. We prove that MHLA maintains linear complexity while recovering much of the expressive power of softmax attention, and verify its effectiveness across multiple domains, achieving a…
Peer Reviews
Decision·ICLR 2026 Poster
1. The diagnosis of "global context collapse," supported by a concise analysis of rank deficiency and entropy elevation, provides intuitive motivation for the work. It clearly articulates why previous linear attention models often underperform. 2. The paper presents extensive experiments across multiple domains (computer vision and NLP).
1. The choice of the term "Multi-Head" is confusing and conflicts with the well-established definition from original Transformer, which refers to splitting the channel dimension. In this paper, "heads" are defined along the token/spatial dimension. This non-standard usage could lead to significant confusion in the community. 2. The paper misses some important comparisons with highly relevant prior work such as FLASH[1] and VOLO[2], which also employs a block-wise strategy combining quadratic an
- Clear diagnosis of “global context collapse” with complementary theoretical indicators, e.g. rank upper bound ≤d for global linear attention; MHLA’s additive blockwise rank potential; and empirical entropy analyses . - Simple, hardware-friendly construction: blockwise summaries + learned nonnegative mixing; retains linear-time leading term and is compatible with chunkwise/streaming execution . - Strong cross-domain results: (i) ImageNet improvements over linear attention baselines and com
- Results are mostly single-number comparisons; there is no reporting of multiple seeds, confidence intervals, or significance tests. For diffusion FID/IS/sFID and ImageNet top-1, please consider report mean±std over ≥3 seeds and specify sample counts and evaluation protocols used for FID/IS (e.g., 50k samples, classifier, resize method) to support the claims of consistent improvements . - Language modeling evaluation is narrow. The 0.3–0.34B model trained on 5B tokens is assessed on a small s
1. Insightful analysis of linear attention – The authors conduct a rigorous examination of why linear attention underperforms, identifying global context collapse as the root cause. Their analysis of rank deficiency and entropy in attention maps provides a strong theoretical foundation. 2. Innovative multi-head design – MHLA introduces a token-level multi-head mechanism that mixes local key–value summaries through learnable coefficients. This design cleverly restores query-dependent diversity a
### 1. Limited and Inconclusive Experimental Evaluation The major limitation of this paper lies in its lack of comprehensive experiments to support its central claim—improving long-term modeling capability. In the vision domain, experiments are restricted to image classification on 224×224 inputs, which are too short and fail to reflect long-range dependencies. In the NLP domain, the evaluation covers only three small-scale reasoning benchmarks (ARC-c, WinoGrande, CoPA), which do not assess long
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications · Generative Adversarial Networks and Image Synthesis · Domain Adaptation and Few-Shot Learning
