Understanding Differential Transformer Unchains Pretrained Self-Attentions
Chaerin Kong, Jiho Jang, Nojun Kwak

TL;DR
This paper investigates how Differential Transformer attention improves model performance, revealing key factors like expressivity and reduced redundancy, and introduces DEX, a lightweight method to incorporate these benefits into pretrained language models, enhancing their performance.
Contribution
The paper uncovers the core reasons behind Differential Transformer's success and proposes DEX, a novel, lightweight approach to integrate differential attention into pretrained models.
Findings
Differential attention enhances expressivity and reduces redundancy.
DEX effectively incorporates differential attention into pretrained models.
DEX achieves significant performance improvements with minimal data.
Abstract
Differential Transformer has recently gained significant attention for its impressive empirical performance, often attributed to its ability to perform noise canceled attention. However, precisely how differential attention achieves its empirical benefits remains poorly understood. Moreover, Differential Transformer architecture demands large-scale training from scratch, hindering utilization of open pretrained weights. In this work, we conduct an in-depth investigation of Differential Transformer, uncovering three key factors behind its success: (1) enhanced expressivity via negative attention, (2) reduced redundancy among attention heads, and (3) improved learning dynamics. Based on these findings, we propose DEX, a novel method to efficiently integrate the advantages of differential attention into pretrained language models. By reusing the softmax attention scores and adding a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsNeural Networks and Applications · Power Quality and Harmonics · Analog and Mixed-Signal Circuit Design
MethodsAttention Is All You Need · Linear Layer · Layer Normalization · Multi-Head Attention · Dense Connections · Softmax · Position-Wise Feed-Forward Layer · Absolute Position Encodings · Residual Connection · Byte Pair Encoding
