Understanding Differential Transformer Unchains Pretrained Self-Attentions

Chaerin Kong; Jiho Jang; Nojun Kwak

arXiv:2505.16333·cs.LG·October 22, 2025

Understanding Differential Transformer Unchains Pretrained Self-Attentions

Chaerin Kong, Jiho Jang, Nojun Kwak

PDF

Open Access 1 Video

TL;DR

This paper investigates how Differential Transformer attention improves model performance, revealing key factors like expressivity and reduced redundancy, and introduces DEX, a lightweight method to incorporate these benefits into pretrained language models, enhancing their performance.

Contribution

The paper uncovers the core reasons behind Differential Transformer's success and proposes DEX, a novel, lightweight approach to integrate differential attention into pretrained models.

Findings

01

Differential attention enhances expressivity and reduces redundancy.

02

DEX effectively incorporates differential attention into pretrained models.

03

DEX achieves significant performance improvements with minimal data.

Abstract

Differential Transformer has recently gained significant attention for its impressive empirical performance, often attributed to its ability to perform noise canceled attention. However, precisely how differential attention achieves its empirical benefits remains poorly understood. Moreover, Differential Transformer architecture demands large-scale training from scratch, hindering utilization of open pretrained weights. In this work, we conduct an in-depth investigation of Differential Transformer, uncovering three key factors behind its success: (1) enhanced expressivity via negative attention, (2) reduced redundancy among attention heads, and (3) improved learning dynamics. Based on these findings, we propose DEX, a novel method to efficiently integrate the advantages of differential attention into pretrained language models. By reusing the softmax attention scores and adding a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Understanding Differential Transformer Unchains Pretrained Self-Attentions· slideslive

Taxonomy

TopicsNeural Networks and Applications · Power Quality and Harmonics · Analog and Mixed-Signal Circuit Design

MethodsAttention Is All You Need · Linear Layer · Layer Normalization · Multi-Head Attention · Dense Connections · Softmax · Position-Wise Feed-Forward Layer · Absolute Position Encodings · Residual Connection · Byte Pair Encoding