TL;DR
This paper investigates attention drift in autoregressive speculative decoding models, identifies its cause related to residual paths, and proposes architectural modifications that significantly improve model robustness and performance.
Contribution
It uncovers the phenomenon of attention drift in speculative decoding, traces its cause to residual path dynamics, and introduces architectural changes that enhance model robustness.
Findings
Attention drift causes attention to shift from prompt to generated tokens.
Architectural modifications like post-norm and RMSNorm mitigate attention drift.
Proposed changes double acceptance length and improve performance on benchmarks.
Abstract
Speculative decoding accelerates LLM inference by drafting future tokens with a small model, but drafter models degrade sharply under template perturbation and long-context inputs. We identify a previously-unreported phenomenon we call \textbf{attention drift}: as the drafter generates successive tokens within a speculation chain, attention progressively moves from the prompt onto its own recently-generated tokens. We observe this across both \emph{EAGLE3} drafters and \emph{MTP heads}, suggesting drift is a property of drafter designs. We trace this to the un-normalized residual path between chain steps: the drafter's hidden state magnitude grows monotonically with chain depth, which exhibits dynamics consistent with additional pre-norm transformer layers stacked on the target rather than as a standalone autoregressive predictor. In order to limit the growth, we propose two…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
