TL;DR
Deep Delta Learning introduces a residual update mechanism for Transformers that allows selective rewriting of residual content, improving language modeling quality over traditional additive residuals.
Contribution
It proposes a novel delta-rule residual update method enabling layers to selectively overwrite residual content in Transformer models.
Findings
DDL improves language modeling performance compared to pure additive residuals.
Residual rewrite operations enhance model quality in downstream tasks.
DDL maintains the identity path while allowing content replacement.
Abstract
Transformer residual streams evolve by additive accumulation: each layer appends a feature update to a shared hidden state, but has no direct mechanism for replacing content that has become obsolete or conflicting. We introduce Deep Delta Learning (DDL), a residual update rule that preserves the identity path while giving every layer the ability to selectively rewrite residual content. DDL reads the current state along a learned direction, compares it with a learned target value, and writes back a gated correction along the same direction. When the gate is closed, the update reduces to the identity; when the gate is fully open, the selected component is overwritten, yielding a depth-wise delta-rule generalization of standard residual addition. We integrate DDL in decoder-only language models with both scalar and expanded residual states, while keeping attention and MLP sublayers at the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
