Differential Gated Self-Attention
Elpiniki Maria Lygizou, M\'onika Farsang, Radu Grosu

TL;DR
This paper introduces Multihead Differential Gated Self-Attention (M-DGSA), a biologically inspired mechanism that enhances Transformer robustness to noisy inputs by dynamically gating attention through excitatory and inhibitory branches, with minimal computational cost.
Contribution
It proposes a novel input-dependent gating mechanism for self-attention inspired by biological lateral inhibition, integrating contrast enhancement into Transformer models.
Findings
Demonstrates improved robustness on vision and language benchmarks.
Achieves consistent performance gains over baseline Transformers.
Shows effective noise suppression with minimal computational overhead.
Abstract
Transformers excel across a large variety of tasks but remain susceptible to corrupted inputs, since standard self-attention treats all query-key interactions uniformly. Inspired by lateral inhibition in biological neural circuits and building on the recent use by the Differential Transformer's use of two parallel softmax subtraction for noise cancellation, we propose Multihead Differential Gated Self-Attention (M-DGSA) that learns per-head input-dependent gating to dynamically suppress attention noise. Each head splits into excitatory and inhibitory branches whose dual softmax maps are fused by a sigmoid gate predicted from the token embedding, yielding a context-aware contrast enhancement. M-DGSA integrates seamlessly into existing Transformer stacks with minimal computational overhead. We evaluate on both vision and language benchmarks, demonstrating consistent robustness gains over…
Peer Reviews
Decision·Submitted to ICLR 2026
+ The motivation of the propopsed method is resonable introducing an interpretable gating mechanism based on lateral inhibition. + M-DGSA shows improved accuracy and noise resilience across several vision and language tasks, outperforming baselines. It also produces sharper, more focused attention maps. + M-DGSA can be incorporated into existing Transformer architectures with negligible computational or memory cost.
- The gating mechanism, while lightweight, adds more complexity to the attention computation and may require careful tuning. It is not clear how the proposed method can effectively and efficiently scale up: effects on training stability, convergence speed, or performance on very large-scale or long-sequence tasks. - The evaluations are limited to synthetic noise. Most robustness experiments use synthetic corruptions, while real-world noise and other modalities e.g., cross-attention, multimodal
- Replacing DT's static $\lambda$ with an input-dependent gate is a reasonable modification - Evaluations are averaged over 5-seeds with sd reported, which is excellent - Empirical evaluation is fairly sound, though shows relatively modest results (ImageNet is convincing, mod concerns about memory/compute being held equal) - Something only mentioned in the appendix: they got rid of DT's $\lambda$ schedule, using a fixed value of 0.8. This seems like a potentially useful contribution, especially
- More discussion of compute/memory requirements would be appreciated; it's unclear if the relatively small gains on ImageNet are worth potential additional training/inference time. The appendix mentions it's roughly equal, but seems somewhat offhand. - Relatively small gains compared to DT itself, except the Newsgroup dataset where DT performs suspiciously badly. - Undertrained CIFAR-10 baselines -- 75% accuracy on CIFAR-10 is super low, makes it hard to trust the comparisons. You could use so
The paper introduces M-DGSA, a method that learns per-head, input-dependent gating to dynamically suppress attention noise. The results demonstrate consistent improvements in noisy environments, showcasing the method's effectiveness. Additionally, the paper is well-written and accessible.
The paper presents a relatively straightforward idea and lacks significant originality. Compared to the approach in Ye et al. (2024), there are no major innovations or departures in the proposed method. The experiments have some notable limitations. The CIFAR and MNIST datasets are relatively small and simple, which may not fully showcase the model's capabilities. Additionally, the reported ImageNet accuracy (Table 2) is low. Due to these factors, the claims regarding the algorithm's effectiven
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNeural Networks and Applications
MethodsLinear Layer · Adam · Dense Connections · Vision Transformer · Softmax · Position-Wise Feed-Forward Layer · Absolute Position Encodings · Label Smoothing · Multi-Head Attention · Byte Pair Encoding
