READER: Retrieval-Assisted Drafter for Efficient LLM Inference
Maxim Divilkovskiy, Vitaly Malygin, Sergey Zlobin, Stanislav Ilyushin, Sultan Isali, Vasily Kalugin, Nuriza Aitassova, Fei Yi, Weidi Zeng

TL;DR
READER introduces a lossless speculative decoding framework that significantly accelerates large language model inference by exploiting natural language redundancy, achieving up to 6.13x speedup without sacrificing output accuracy.
Contribution
It presents a novel, provably lossless speculative decoding method that bypasses auxiliary model training and offers a scalable, memory-efficient approach for faster LLM inference.
Findings
Up to 6.13x speedup on single-prompt inference
Up to 5.92x speedup on batched inference
Consistent outperforming of prior speculative decoding methods
Abstract
Autoregressive Language Models instantiate a factorized likelihood over token sequences, yet their strictly sequential decoding process imposes an intrinsic lower bound on inference latency. This bottleneck has emerged as a central obstacle to the scalable deployment of large-scale generative models. Existing acceleration techniques partially mitigate token-level latency by relying on auxiliary draft models or introducing an additional training phase, but fail to address the dominant memory and communication costs. We present READER, a provably lossless speculative decoding framework that bypasses the training of the auxiliary draft model. READER formalizes speculative decoding as a stochastic tree construction problem and exploits the empirical redundancy structure of natural language to generate high-probability candidate continuations. Our method revisits the problem of constructing…
Peer Reviews
Decision·Submitted to ICLR 2026
1. The paper presents an investigation to combine both retrieval-based and drafting-based speculative decoding. The exploration efforts should be encouraged. 2. The authors conduct comprehensive experiments on a wide range of text generation benchmarks. The models include Llama2-7B, Vicuna-7B, and Llama3.1-8B. 3. It is worth noting that the authors consider the KV cache management in the speculative decoding paradigm and conduct experiments with batched settings, which are valuable in practice.
1. **The writing is poor and the demonstrations are confusing**. This manuscript is poor in writing, and readers may find it difficult to grasp the main contributions of the designed methodology. Detailed errors and problems are noted in the questions 1-5 below. I strongly recommend that the authors polish this manuscript further for the next submission. 2. **Lack of detail in the methodology**. I understand that READER aims to accelerate speculative decoding by constructing drafts augmented wit
1. The hybrid approach of combining a learned draft model with a training-free, retrieval-based drafter is a novel and highly effective idea. This allows the system to leverage the best of both worlds: the draft model's ability to generate novel text and the retrieval system's efficiency in handling common, repetitive sequences. 2. The paper is well-grounded in theory, formalizing the speculative decoding process as a throughput optimization problem over a heterogeneous tree. This provides a pr
1. The primary concern is that the method's impressive performance may be heavily skewed towards tasks with high textual repetition, which is the ideal scenario for a retrieval-based drafter. The remarkable >10x speedup on RAG, where the model copies extensively from the context, and the strong performance on code generation are clear evidence of this. However, the benefits might be substantially lower for more open-ended, creative, or complex reasoning tasks that require generating novel text w
1. The combination of retrieval-based speculative decoding with model-based speculative decoding is quite interesting and novel, in my opionion. 2. The speedup number is quite impressive.
1. > bypasses the training of the auxiliary draft model This is a bit misleading as the method still needs a trained draft model. 2. I feel Section 3 is also part of the methodology and maybe Section 4 should be renamed to improve clarity. 3. Section 4 is a bit confusing on its own. Most of the algorithm details are in appendix. It will be helpful if a more concise version pseudo-algorithm on the overall method is included in Section 4. 4. The evaluation is a bit too short (only one and a
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Big Data and Digital Economy
