TL;DR
This paper introduces gradient-boosted attention within a single transformer layer, enhancing performance by iteratively correcting attention errors, inspired by gradient boosting principles.
Contribution
It proposes a novel attention mechanism that applies gradient boosting principles inside a single layer, improving language modeling benchmarks.
Findings
Gradient-boosted attention improves test perplexity by over 5% on WikiText-103 and OpenWebText.
Two rounds of correction capture most of the performance gains.
The mechanism relies on the additive residual structure of Pre-LN transformers.
Abstract
Transformer attention computes a single softmax-weighted average over values -- a one-pass estimate that cannot correct its own errors. We introduce \emph{gradient-boosted attention}, which applies the principle of gradient boosting \emph{within} a single attention layer: a second attention pass, with its own learned projections, attends to the prediction error of the first and applies a gated correction. Under a squared reconstruction objective, the construction maps onto Friedman's gradient boosting machine, with each attention pass as a base learner and the per-dimension gate as the shrinkage parameter. We show that a single Hopfield-style update erases all query information orthogonal to the stored-pattern subspace, and that further iteration under local contraction can collapse distinct queries in the same region to the same fixed point. We also show that separate projections for…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
