TL;DR
RACER is a lightweight, training-free decoding method for LLMs that combines retrieval-based exact patterns with logit-driven cues to significantly accelerate inference while maintaining accuracy.
Contribution
It introduces RACER, a novel retrieval-augmented speculative decoding technique that unifies retrieval and logits to improve speed and reliability without additional training.
Findings
Achieves over 2x speedup in inference compared to autoregressive decoding.
Outperforms prior training-free speculative decoding methods.
Demonstrates effectiveness on Spec-Bench, HumanEval, and MGSM-ZH datasets.
Abstract
Autoregressive decoding in Large Language Models (LLMs) generates one token per step, causing high inference latency. Speculative decoding (SD) mitigates this through a guess-and-verify strategy, but existing training-free variants face trade-offs: retrieval-based drafts break when no exact match exists, while logits-based drafts lack structural guidance. We propose (etrieval-ugmented ontxtual apid Speculative Decoding), a lightweight and training-free method that integrates retrieved exact patterns with logit-driven future cues. This unification supplies both reliable anchors and flexible extrapolation, yielding richer speculative drafts. Experiments on Spec-Bench, HumanEval, and MGSM-ZH demonstrate that RACER consistently accelerates inference, achieving more than speedup over autoregressive…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
