Position-Aware Drafting for Inference Acceleration in LLM-Based Generative List-Wise Recommendation

Jiaju Chen; Chongming Gao; Chenxiao Fan; Haoyan Liu; Qingpeng Cai; Peng Jiang; and Xiangnan He

arXiv:2604.27747·cs.IR·May 1, 2026

Position-Aware Drafting for Inference Acceleration in LLM-Based Generative List-Wise Recommendation

Jiaju Chen, Chongming Gao, Chenxiao Fan, Haoyan Liu, Qingpeng Cai, Peng Jiang, and Xiangnan He

PDF

TL;DR

The paper introduces PAD-Rec, a position-aware drafting module that enhances speculative decoding in LLM-based list-wise recommendation, significantly improving inference speed while maintaining quality.

Contribution

It proposes a lightweight, trainable module that encodes item position and draft step information to better model uncertainty and structural cues in generative recommendation.

Findings

01

Achieves up to 3.1x speedup in inference time.

02

Provides about 5% average speedup over strong baselines.

03

Largely preserves recommendation quality despite acceleration.

Abstract

Large language model (LLM)-based generative list-wise recommendation has advanced rapidly, but decoding remains sequential and thus latency-prone. To accelerate inference without changing the target distribution, speculative decoding (SD) uses a small draft model to propose several next tokens at once and a target LLM to verify and accept the longest prefix, skipping multiple steps per round. In generative recommendation, however, each item is represented by multiple semantic-ID tokens, often with separators, and current drafts typically treat these tokens uniformly. This overlooks two practical facts: (i) a token's semantics depend on its within-item slot, and (ii) uncertainty tends to increase with speculation depth. Without modeling these effects, SD's speedups can be limited. We introduce PAD-Rec, Position-Aware Drafting for generative Recommendation, a lightweight module that…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.