RACER: Retrieval-Augmented Contextual Rapid Speculative Decoding

Zihong Zhang; Zuchao Li; Lefei Zhang; Ping Wang; Hai Zhao

arXiv:2604.14885·cs.CL·April 17, 2026

RACER: Retrieval-Augmented Contextual Rapid Speculative Decoding

Zihong Zhang, Zuchao Li, Lefei Zhang, Ping Wang, Hai Zhao

PDF

1 Repo

TL;DR

RACER is a lightweight, training-free decoding method for LLMs that combines retrieval-based exact patterns with logit-driven cues to significantly accelerate inference while maintaining accuracy.

Contribution

It introduces RACER, a novel retrieval-augmented speculative decoding technique that unifies retrieval and logits to improve speed and reliability without additional training.

Findings

01

Achieves over 2x speedup in inference compared to autoregressive decoding.

02

Outperforms prior training-free speculative decoding methods.

03

Demonstrates effectiveness on Spec-Bench, HumanEval, and MGSM-ZH datasets.

Abstract

Autoregressive decoding in Large Language Models (LLMs) generates one token per step, causing high inference latency. Speculative decoding (SD) mitigates this through a guess-and-verify strategy, but existing training-free variants face trade-offs: retrieval-based drafts break when no exact match exists, while logits-based drafts lack structural guidance. We propose $RACER$ ( $R$ etrieval- $A$ ugmented $C$ ont $e$ xtual $R$ apid Speculative Decoding), a lightweight and training-free method that integrates retrieved exact patterns with logit-driven future cues. This unification supplies both reliable anchors and flexible extrapolation, yielding richer speculative drafts. Experiments on Spec-Bench, HumanEval, and MGSM-ZH demonstrate that RACER consistently accelerates inference, achieving more than $2 \times$ speedup over autoregressive…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

hkr04/RACER
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.