Position-Aware Drafting for Inference Acceleration in LLM-Based Generative List-Wise Recommendation
Jiaju Chen, Chongming Gao, Chenxiao Fan, Haoyan Liu, Qingpeng Cai, Peng Jiang, and Xiangnan He

TL;DR
The paper introduces PAD-Rec, a position-aware drafting module that enhances speculative decoding in LLM-based list-wise recommendation, significantly improving inference speed while maintaining quality.
Contribution
It proposes a lightweight, trainable module that encodes item position and draft step information to better model uncertainty and structural cues in generative recommendation.
Findings
Achieves up to 3.1x speedup in inference time.
Provides about 5% average speedup over strong baselines.
Largely preserves recommendation quality despite acceleration.
Abstract
Large language model (LLM)-based generative list-wise recommendation has advanced rapidly, but decoding remains sequential and thus latency-prone. To accelerate inference without changing the target distribution, speculative decoding (SD) uses a small draft model to propose several next tokens at once and a target LLM to verify and accept the longest prefix, skipping multiple steps per round. In generative recommendation, however, each item is represented by multiple semantic-ID tokens, often with separators, and current drafts typically treat these tokens uniformly. This overlooks two practical facts: (i) a token's semantics depend on its within-item slot, and (ii) uncertainty tends to increase with speculation depth. Without modeling these effects, SD's speedups can be limited. We introduce PAD-Rec, Position-Aware Drafting for generative Recommendation, a lightweight module that…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
