Norm$\times$Direction: Restoring the Missing Query Norm in Vision Linear Attention

Weikang Meng; Yadan Luo; Liangyu Huo; Yingjian Li; Yaowei Wang; Xin Li; Zheng Zhang

arXiv:2506.21137·cs.LG·February 5, 2026

Norm$\times$Direction: Restoring the Missing Query Norm in Vision Linear Attention

Weikang Meng, Yadan Luo, Liangyu Huo, Yingjian Li, Yaowei Wang, Xin Li, Zheng Zhang

PDF

Open Access 3 Reviews

TL;DR

This paper introduces NaLaFormer, a novel linear attention mechanism that restores query norm information and preserves inner-product details, leading to state-of-the-art results across multiple vision and multimodal tasks with high efficiency.

Contribution

NaLaFormer employs a norm×direction decomposition to address expressiveness loss in linear attention, improving performance and memory efficiency in vision and multimodal applications.

Findings

01

Achieves up to 7.5% accuracy gain on ImageNet-1K

02

Improves mIoU by 4.7% on ADE20K

03

Reduces peak memory by 92.3% in super-resolution tasks

Abstract

Linear attention mitigates the quadratic complexity of softmax attention but suffers from a critical loss of expressiveness. We identify two primary causes: (1) The normalization operation cancels the query norm, which breaks the correlation between a query's norm and the spikiness (entropy) of the attention distribution as in softmax attention. (2) Standard techniques for enforcing non-negativity cause destructive information loss by nullifying valid inner-product interactions. To address these challenges, we introduce NaLaFormer, a novel linear attention mechanism built upon a norm $\times$ direction (ND) decomposition of the query and key vectors. We leverage each component to solve a distinct problem: The query norm is injected into our kernel to create a query-norm-aware map that restores the attention distribution's spikiness. The direction vectors are processed by a geometric,…

Peer Reviews

Decision·ICLR 2026 Conference Withdrawn Submission

Reviewer 01Rating 2Confidence 4

Strengths

1. To avoid negative values, the re-mapped cosine direction is a somewhat clever technical contribution. 2. The experiment results show strong improvements over many baselines. On ImageNet-1K, NaLaFormer variants outperform recent linear attention and some softmax-based ViT models across all model sizes, often by substantial margins. 3. Both vision (understanding and super-resolution) and language tasks are included, which extends the breadth of the applications of this paper.

Weaknesses

1. The paper does not cite or compare to MetaLA [1] and InLine [2], which both focus on matching softmax’s spikiness or optimizing the linear approximation. Empirical comparisons and deeper methodological discussion are crucial, especially in language tasks, to establish both novelty and superiority. 2. The exact model definitions and training details are not fully disclosed in the appendix, such as Swish activation before the classifier, Layerscales, 1024-dim classifier, convolution patch embe

Reviewer 02Rating 4Confidence 3

Strengths

1. This paper resolves two core limitations of linear attention (query norm cancellation and destructive non-negativity enforcement) through theoretically grounded ND decomposition, which is novel to me. 2. The experiment are conducted on competitive benchmarks, e.g., ImageNet, COCO, ADE20K and DIV2K. 3. The formula and figure are clear and well-illustrated.

Weaknesses

1. How the proposed linear attention work in diffusion is not clear. Since existing works show that linear attention can also perform well in diffusion transformers[1], it is encouraged to add some DiT experiments. 2. The RoPE is used in the proposed method, but not discussed. 3. The ablation study in λ and τ is missing. [1] Efficient Diffusion Transformer with Step-wise Dynamic Attention Mediators, in ECCV 2024.

Reviewer 03Rating 8Confidence 3

Strengths

- This paper has a unique idea, proposes a novel angle, the line is fluent and clear, and the proof process is complete. - The paper presents its own conjecture for the problem that linear attention does not perform as well as vanilla attention, focusing on the negative correlation between the distributional spikes (entropy) of the ATTENTION MATRIX and the Query-norm, which is an interesting entry point, and there is a novelty in the motivation of the paper. - This paper gives an ingenious, effi

Weaknesses

Overall, I found this article enlightening, but I still have a few minor questions I'd like to raise: I don't see too many problems, but I personally have a small comment, I would suggest that the q-norm/entropy correlation plots of linear vs. vanilla attention should be compared in a more forward position, so that the reader can more quickly visualize the core ideas of the paper.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNeural Networks and Applications

MethodsAttentive Walk-Aggregating Graph Neural Network · Softmax