More Expressive Attention with Negative Weights
Ang Lv, Ruobing Xie, Shuaipeng Li, Jiayi Liao, Xingwu Sun, Zhanhui, Kang, Di Wang, Rui Yan

TL;DR
This paper introduces Cog Attention, a novel attention mechanism allowing negative weights, which increases expressiveness and robustness, leading to improved performance in language and image generation models.
Contribution
The paper presents Cog Attention, a new attention method that enables negative weights and enhances model flexibility and robustness, surpassing traditional softmax attention.
Findings
Models with Cog Attention outperform traditional softmax attention models.
Cog Attention improves model robustness against representational collapse.
It enables multiple operations within a single attention head.
Abstract
We propose a novel attention mechanism, named Cog Attention, that enables attention weights to be negative for enhanced expressiveness, which stems from two key factors: (1) Cog Attention enhances parameter flexibility. For example, unlike traditional softmax attention heads that use a static output-value (OV) matrix to delete or copy inputs that the heads attend to, Cog Attention naturally learns to use the sign of dynamic query-key (QK) inner products to represent these operations. This enables Cog Attention to perform multiple operations simultaneously within a single head. Meanwhile, Cog Attention's OV matrix can focus more on refinement or modification. (2) Cog Attention enhances the model's robustness against representational collapse by preventing the ``over-squashing'' of earlier tokens into later positions. We develop Transformer-like models which use Cog Attention as attention…
Peer Reviews
Decision·ICLR 2026 Conference Withdrawn Submission
The research direction is interesting and the proposed method is easy to implement without overhead. Cogformer outperform several variant transformers in standard benchmarks.
The work seems incomplete and figures in page 13-14 are not referred. Introduction is hard to follow. The statement for ‘’’while the output-value (OV) matrix governs the processing of these attended tokens’’’ (line 37-39), ‘’’…introducing negative weights can lead to challenges such as training instability, numerical overflow, and difficulties in attention normalization due to issues like division by zero.’’’ (line 46-48) and (Irrelevant tokens are assigned negative weights for elimination, whi
**Mechanistic Interpretation**: Provides compelling, experiment-backed explanations for how and why negative weights improve expressiveness and robustness, moving beyond mere performance claims. **Practical Efficiency**: Despite the conceptual shift, the method is engineered to be as computationally efficient as standard attention, which is crucial for adoption. **Parameter-Free**: Cog Attention does not introduce new learnable parameters or require delicate hyperparameter tuning, making it ea
**Scope of Evaluation**: The evaluation is primarily on standard NLP benchmarks. Testing on more challenging domains, such as long-context reasoning, code generation, or multilingual tasks, would strengthen the claims about generalizability and mitigate collapse. **Qualitative Analysis of Negative Weights**: The mechanistic interpretation is focused on a few heads in a specific task. A more systematic analysis of the distribution and functional roles of negative weights throughout the network i
- The idea of introducing negative attention weights is novel and underexplored. - The paper provides mechanistic interpretations to support the claimed benefits. - Cog Attention does not introduce additional parameters or hyperparameters, which is a practical advantage. - The models show consistent improvements across multiple tasks and scales.
1. **Limited Experimental Scope** While the results on standard NLP benchmarks are promising, the evaluation could be strengthened by including more challenging settings such as few-shot learning, retrieval-augmented generation, long-context reasoning, or complex logical reasoning tasks. These would better demonstrate the generalizability and practical utility of Cog Attention. 2. **Insufficient Baseline Comparisons** The current baselines (Differential Attention and Centered Attentio
* The idea of allowing negative attention weights is simple and interesting. It has the potential to tackle robustness issues of standard attention.
* The models seem to be undertrained? The performance on SST-2 a binary classification task is barely above chance. Majority class/random baseline performance should be reported for all datasets. * The analysis of the Cog expressiveness in Section 3 is somewhat limited. You should report metrics such as attention concentration, head diversity, sink and local focus [1,2], on the language modeling tasks beyond the toy tasks that you currently include. This analysis will provide a more rigorous an
1. Novel: Efforts to achieve non-negative, non-normalized attention are meaningful. This also facilitates our better understanding of the standard softmax attention mechanism itself. 2. The experiment effectively supported the motivation.
1. Lack of baseline and ablation. I think at least two additional ablation experiments are needed, see question 3. 2. Some things lack evidence to support them. E.g. "This method is driven by our observation that an effective attention pattern for convergence must demonstrate sufficient kurtosis—that is, it should be sparse and sharp enough." Actually, you can refer to this literature [1]. 3. Potential negative impacts not discussed. [1] The Devil in Linear Transformer. https://arxiv.org/
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsCognitive Science and Education Research · Computability, Logic, AI Algorithms · Creativity in Education and Neuroscience
MethodsAttention Is All You Need · Focus · Diffusion · Softmax
