TL;DR
LogitSpec enhances retrieval-based speculative decoding for large language models by using logit-based speculation of subsequent tokens, leading to significant inference speedups and improved token acceptance rates.
Contribution
It introduces a training-free, plug-and-play method that expands retrieval range by speculating the next next token using last logit, improving decoding efficiency.
Findings
Achieves up to 2.61× speedup in inference
Increases mean accepted tokens per decoding step to 3.28
Demonstrates effectiveness across various text generation benchmarks
Abstract
Speculative decoding (SD), where a small draft model is employed to propose draft tokens in advance and then the target model validates them in parallel, has emerged as a promising technique for LLM inference acceleration. Many endeavors to improve SD are to eliminate the need for a draft model and generate draft tokens in a retrieval-based manner in order to further alleviate the drafting overhead and significantly reduce the difficulty in deployment and applications. However, retrieval-based SD relies on a matching paradigm to retrieval the most relevant reference as the draft tokens, where these methods often fail to find matched and accurate draft tokens. To address this challenge, we propose LogitSpec to effectively expand the retrieval range and find the most relevant reference as drafts. Our LogitSpec is motivated by the observation that the logit of the last token can not only…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
