Efficiency Unleashed: Inference Acceleration for LLM-based Recommender Systems with Speculative Decoding
Yunjia Xi, Hangyu Wang, Bo Chen, Jianghao Lin, Menghui Zhu, Weiwen, Liu, Ruiming Tang, Zhewei Wei, Weinan Zhang, Yong Yu

TL;DR
This paper introduces LASER, a speculative decoding method that significantly accelerates LLM-based recommender systems by optimizing retrieval and verification processes, achieving 3-5x speedup with minimal performance loss.
Contribution
LASER presents a novel speculative decoding framework tailored for recommender systems, improving efficiency through customized retrieval and relaxed verification, with proven speedups and resource savings.
Findings
Achieves 3-5x speedup on public datasets.
Reduces computational resources by 67% during online testing.
Maintains lossless downstream recommendation performance.
Abstract
The past few years have witnessed a growing interest in LLM-based recommender systems (RSs), although their industrial deployment remains in a preliminary stage. Most existing deployments leverage LLMs offline as feature enhancers, generating augmented knowledge for downstream tasks. However, in recommendation scenarios with numerous users and items, even offline knowledge generation with LLMs demands significant time and computational resources. This inefficiency arises from the autoregressive nature of LLMs. A promising solution is speculative decoding, a Draft-Then-Verify approach that increases the number of tokens generated per decoding step. In this work, we first identify recommendation knowledge generation as a highly fitting use case for retrieval-based speculative decoding. Then, we discern its two characteristics: (1) the vast number of items and users in RSs leads to…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDistributed and Parallel Computing Systems · Cloud Computing and Resource Management
