Efficiency Unleashed: Inference Acceleration for LLM-based Recommender   Systems with Speculative Decoding

Yunjia Xi; Hangyu Wang; Bo Chen; Jianghao Lin; Menghui Zhu; Weiwen; Liu; Ruiming Tang; Zhewei Wei; Weinan Zhang; Yong Yu

arXiv:2408.05676·cs.IR·April 30, 2025

Efficiency Unleashed: Inference Acceleration for LLM-based Recommender Systems with Speculative Decoding

Yunjia Xi, Hangyu Wang, Bo Chen, Jianghao Lin, Menghui Zhu, Weiwen, Liu, Ruiming Tang, Zhewei Wei, Weinan Zhang, Yong Yu

PDF

Open Access 1 Repo

TL;DR

This paper introduces LASER, a speculative decoding method that significantly accelerates LLM-based recommender systems by optimizing retrieval and verification processes, achieving 3-5x speedup with minimal performance loss.

Contribution

LASER presents a novel speculative decoding framework tailored for recommender systems, improving efficiency through customized retrieval and relaxed verification, with proven speedups and resource savings.

Findings

01

Achieves 3-5x speedup on public datasets.

02

Reduces computational resources by 67% during online testing.

03

Maintains lossless downstream recommendation performance.

Abstract

The past few years have witnessed a growing interest in LLM-based recommender systems (RSs), although their industrial deployment remains in a preliminary stage. Most existing deployments leverage LLMs offline as feature enhancers, generating augmented knowledge for downstream tasks. However, in recommendation scenarios with numerous users and items, even offline knowledge generation with LLMs demands significant time and computational resources. This inefficiency arises from the autoregressive nature of LLMs. A promising solution is speculative decoding, a Draft-Then-Verify approach that increases the number of tokens generated per decoding step. In this work, we first identify recommendation knowledge generation as a highly fitting use case for retrieval-based speculative decoding. Then, we discern its two characteristics: (1) the vast number of items and users in RSs leads to…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

yunjiaxi/laser
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsDistributed and Parallel Computing Systems · Cloud Computing and Resource Management