Attention Drift: What Autoregressive Speculative Decoding Models Learn

Do\u{g}a\c{c} Eldenk; Payal Mohapatra; Yigitcan Comlek; Kaan Oktay; Hongyang Zhang; Stephen Xia

arXiv:2605.09992·cs.LG·May 12, 2026

Attention Drift: What Autoregressive Speculative Decoding Models Learn

Do\u{g}a\c{c} Eldenk, Payal Mohapatra, Yigitcan Comlek, Kaan Oktay, Hongyang Zhang, Stephen Xia

PDF

2 Models

TL;DR

This paper investigates attention drift in autoregressive speculative decoding models, identifies its cause related to residual paths, and proposes architectural modifications that significantly improve model robustness and performance.

Contribution

It uncovers the phenomenon of attention drift in speculative decoding, traces its cause to residual path dynamics, and introduces architectural changes that enhance model robustness.

Findings

01

Attention drift causes attention to shift from prompt to generated tokens.

02

Architectural modifications like post-norm and RMSNorm mitigate attention drift.

03

Proposed changes double acceptance length and improve performance on benchmarks.

Abstract

Speculative decoding accelerates LLM inference by drafting future tokens with a small model, but drafter models degrade sharply under template perturbation and long-context inputs. We identify a previously-unreported phenomenon we call \textbf{attention drift}: as the drafter generates successive tokens within a speculation chain, attention progressively moves from the prompt onto its own recently-generated tokens. We observe this across both \emph{EAGLE3} drafters and \emph{MTP heads}, suggesting drift is a property of drafter designs. We trace this to the un-normalized residual path between chain steps: the drafter's hidden state magnitude grows monotonically with chain depth, which exhibits dynamics consistent with additional pre-norm transformer layers stacked on the target rather than as a standalone autoregressive predictor. In order to limit the growth, we propose two…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.