Reinforcement Speculative Decoding for Fast Ranking
Yingpeng Du, Tianjun Wei, Zhu Sun, Jie Zhang

TL;DR
This paper introduces Reinforcement Speculative Decoding, a novel method that uses reinforcement learning to improve the speed and accuracy of large language models in ranking tasks under strict latency constraints.
Contribution
It proposes an up-to-down decoding paradigm with a reinforcement learning-based agent that iteratively modifies rankings, leveraging listwise knowledge for faster inference in ranking systems.
Findings
Significant speedup in ranking inference tasks.
Maintains high ranking accuracy with reduced latency.
Effective in both information retrieval and recommender systems.
Abstract
Large Language Models (LLMs) have been widely adopted in ranking systems such as information retrieval (IR) systems and recommender systems (RSs). To alleviate the latency of auto-regressive decoding, some studies explore the single (first) token decoding for ranking approximation, but they suffer from severe degradation in tail positions. Although speculative decoding (SD) methods can be a remedy with verification at different positions, they face challenges in ranking systems due to their left-to-right decoding paradigm. Firstly, ranking systems require strict latency constraints, but verification rounds in SD methods remain agnostic; Secondly, SD methods usually discard listwise ranking knowledge about unaccepted items in previous rounds, hindering future multi-token prediction, especially when candidate tokens are the unaccepted items. In this paper, we propose a Reinforcement…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Algebra and Logic · Bayesian Modeling and Causal Inference · Logic, Reasoning, and Knowledge
