Shared DIFF Transformer
Yueyang Cang, Yuhang Liu, Xiaoteng Zhang, Li Shi, Wenge Que

TL;DR
Shared DIFF Transformer enhances the original differential attention mechanism by reducing parameter redundancy through shared base matrices and low-rank updates, leading to improved efficiency and performance in long-sequence tasks.
Contribution
It introduces a shared base matrix and low-rank updates to the differential attention mechanism, significantly reducing parameters and boosting task-specific flexibility.
Findings
Outperforms DIFF Transformer in long-sequence modeling
Achieves better key information retrieval accuracy
Enhances in-context learning capabilities
Abstract
DIFF Transformer improves attention allocation by enhancing focus on relevant context while suppressing noise. It introduces a differential attention mechanism that calculates the difference between two independently generated attention distributions, effectively reducing noise and promoting sparse attention patterns. However, the independent signal generation in DIFF Transformer results in parameter redundancy and suboptimal utilization of information. In this work, we propose Shared DIFF Transformer, which draws on the idea of a differential amplifier by introducing a shared base matrix to model global patterns and incorporating low-rank updates to enhance task-specific flexibility. This design significantly reduces parameter redundancy, improves efficiency, and retains strong noise suppression capabilities. Experimental results show that, compared to DIFF Transformer, our method…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSensor Technology and Measurement Systems · Magnetic Bearings and Levitation Dynamics · Optical Systems and Laser Technology
MethodsLinear Layer · Dense Connections · Multi-Head Attention · Position-Wise Feed-Forward Layer · Label Smoothing · Layer Normalization · Softmax · Balanced Selection · Adam · Residual Connection
