Norm$\times$Direction: Restoring the Missing Query Norm in Vision Linear Attention
Weikang Meng, Yadan Luo, Liangyu Huo, Yingjian Li, Yaowei Wang, Xin Li, Zheng Zhang

TL;DR
This paper introduces NaLaFormer, a novel linear attention mechanism that restores query norm information and preserves inner-product details, leading to state-of-the-art results across multiple vision and multimodal tasks with high efficiency.
Contribution
NaLaFormer employs a norm×direction decomposition to address expressiveness loss in linear attention, improving performance and memory efficiency in vision and multimodal applications.
Findings
Achieves up to 7.5% accuracy gain on ImageNet-1K
Improves mIoU by 4.7% on ADE20K
Reduces peak memory by 92.3% in super-resolution tasks
Abstract
Linear attention mitigates the quadratic complexity of softmax attention but suffers from a critical loss of expressiveness. We identify two primary causes: (1) The normalization operation cancels the query norm, which breaks the correlation between a query's norm and the spikiness (entropy) of the attention distribution as in softmax attention. (2) Standard techniques for enforcing non-negativity cause destructive information loss by nullifying valid inner-product interactions. To address these challenges, we introduce NaLaFormer, a novel linear attention mechanism built upon a normdirection (ND) decomposition of the query and key vectors. We leverage each component to solve a distinct problem: The query norm is injected into our kernel to create a query-norm-aware map that restores the attention distribution's spikiness. The direction vectors are processed by a geometric,…
Peer Reviews
Decision·ICLR 2026 Conference Withdrawn Submission
1. To avoid negative values, the re-mapped cosine direction is a somewhat clever technical contribution. 2. The experiment results show strong improvements over many baselines. On ImageNet-1K, NaLaFormer variants outperform recent linear attention and some softmax-based ViT models across all model sizes, often by substantial margins. 3. Both vision (understanding and super-resolution) and language tasks are included, which extends the breadth of the applications of this paper.
1. The paper does not cite or compare to MetaLA [1] and InLine [2], which both focus on matching softmax’s spikiness or optimizing the linear approximation. Empirical comparisons and deeper methodological discussion are crucial, especially in language tasks, to establish both novelty and superiority. 2. The exact model definitions and training details are not fully disclosed in the appendix, such as Swish activation before the classifier, Layerscales, 1024-dim classifier, convolution patch embe
1. This paper resolves two core limitations of linear attention (query norm cancellation and destructive non-negativity enforcement) through theoretically grounded ND decomposition, which is novel to me. 2. The experiment are conducted on competitive benchmarks, e.g., ImageNet, COCO, ADE20K and DIV2K. 3. The formula and figure are clear and well-illustrated.
1. How the proposed linear attention work in diffusion is not clear. Since existing works show that linear attention can also perform well in diffusion transformers[1], it is encouraged to add some DiT experiments. 2. The RoPE is used in the proposed method, but not discussed. 3. The ablation study in λ and τ is missing. [1] Efficient Diffusion Transformer with Step-wise Dynamic Attention Mediators, in ECCV 2024.
- This paper has a unique idea, proposes a novel angle, the line is fluent and clear, and the proof process is complete. - The paper presents its own conjecture for the problem that linear attention does not perform as well as vanilla attention, focusing on the negative correlation between the distributional spikes (entropy) of the ATTENTION MATRIX and the Query-norm, which is an interesting entry point, and there is a novelty in the motivation of the paper. - This paper gives an ingenious, effi
Overall, I found this article enlightening, but I still have a few minor questions I'd like to raise: I don't see too many problems, but I personally have a small comment, I would suggest that the q-norm/entropy correlation plots of linear vs. vanilla attention should be compared in a more forward position, so that the reader can more quickly visualize the core ideas of the paper.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNeural Networks and Applications
MethodsAttentive Walk-Aggregating Graph Neural Network · Softmax
