RADAR: Accelerating Large Language Model Inference With RL-Based Dynamic Draft Trees

Junjie Ma; Jinlong Li

arXiv:2512.14069·cs.AI·December 17, 2025

RADAR: Accelerating Large Language Model Inference With RL-Based Dynamic Draft Trees

Junjie Ma, Jinlong Li

PDF

Open Access

TL;DR

RADAR introduces a reinforcement learning-based dynamic draft tree approach to optimize speculative sampling, significantly accelerating large language model inference by reducing redundant computations.

Contribution

It presents a novel RL-based method for dynamic draft tree generation in speculative sampling, improving inference speed and flexibility for large language models.

Findings

01

Achieves 3.17x-4.82x speedup over baseline methods

02

Reduces redundant model calls during inference

03

Effective across multiple LLMs and tasks

Abstract

Inference with modern Large Language Models (LLMs) is expensive and slow, and speculative sampling has emerged as an effective solution to this problem, however, the number of the calls to the draft model for generating candidate tokens in speculative sampling is a preset hyperparameter, lacking flexibility. To generate and utilize the candidate tokens more effectively, we propose RADAR, a novel speculative sampling method with RL-based dynamic draft trees. RADAR formulates the draft tree generation process as a Markov Decision Process (MDP) and employs offline reinforcement learning to train a prediction model, which enables real-time decision on the calls to the draft model, reducing redundant computations and further accelerating inference. Evaluations across three LLMs and four tasks show that RADAR achieves a speedup of 3.17x-4.82x over the auto-regressive decoding baseline. The…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Computational and Text Analysis Methods · Natural Language Processing Techniques