Test-Time Speculation
Avinash Kumar, Sujay Sanghavi, Poulami Das

TL;DR
Test-Time Speculation (TTS) is an online distillation method that adaptively improves speculative decoding for long-response tasks by leveraging test-time feedback, significantly enhancing acceptance lengths.
Contribution
The paper introduces TTS, a novel online adaptation technique that improves speculative decoding by continuously updating the draft model during inference.
Findings
TTS increases acceptance lengths by up to 72% over state-of-the-art speculators.
Acceptance lengths decline with generation length in existing methods, limiting long-response performance.
TTS maintains higher acceptance lengths across multiple models and longer outputs.
Abstract
Speculative decoding accelerates LLM inference by using a fast draft model to generate tokens and a more accurate target model to verify them. Its performance depends on the , or number of draft tokens accepted by the target. Our studies show that the acceptance length of even state-of-the-art speculators, like DFlash, EAGLE-3 and PARD degrade with generation length, reaching values close to 1 (i.e. no speedup) within just a few thousand output tokens, making speculators ineffective for long-response tasks. Acceptance lengths decline because most speculators are trained offline on short sequences, but are forced to match the target model on much longer outputs at inference, well beyond their training distribution. To address this issue, we propose , an online distillation approach that continuously adapts the speculator…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
