Shared DIFF Transformer

Yueyang Cang; Yuhang Liu; Xiaoteng Zhang; Li Shi; Wenge Que

arXiv:2501.17900·cs.LG·December 17, 2025

Shared DIFF Transformer

Yueyang Cang, Yuhang Liu, Xiaoteng Zhang, Li Shi, Wenge Que

PDF

Open Access

TL;DR

Shared DIFF Transformer enhances the original differential attention mechanism by reducing parameter redundancy through shared base matrices and low-rank updates, leading to improved efficiency and performance in long-sequence tasks.

Contribution

It introduces a shared base matrix and low-rank updates to the differential attention mechanism, significantly reducing parameters and boosting task-specific flexibility.

Findings

01

Outperforms DIFF Transformer in long-sequence modeling

02

Achieves better key information retrieval accuracy

03

Enhances in-context learning capabilities

Abstract

DIFF Transformer improves attention allocation by enhancing focus on relevant context while suppressing noise. It introduces a differential attention mechanism that calculates the difference between two independently generated attention distributions, effectively reducing noise and promoting sparse attention patterns. However, the independent signal generation in DIFF Transformer results in parameter redundancy and suboptimal utilization of information. In this work, we propose Shared DIFF Transformer, which draws on the idea of a differential amplifier by introducing a shared base matrix to model global patterns and incorporating low-rank updates to enhance task-specific flexibility. This design significantly reduces parameter redundancy, improves efficiency, and retains strong noise suppression capabilities. Experimental results show that, compared to DIFF Transformer, our method…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSensor Technology and Measurement Systems · Magnetic Bearings and Levitation Dynamics · Optical Systems and Laser Technology

MethodsLinear Layer · Dense Connections · Multi-Head Attention · Position-Wise Feed-Forward Layer · Label Smoothing · Layer Normalization · Softmax · Balanced Selection · Adam · Residual Connection