MLPs are Efficient Distilled Generative Recommenders
Zitian Guo, Yupeng Hou, Clark Mingxuan Ju, Neil Shah, Julian McAuley

TL;DR
This paper introduces SID-MLP, a lightweight MLP-based distillation method that significantly accelerates generative recommendation models with minimal accuracy loss, replacing complex attention mechanisms with simple, global context capturing MLPs.
Contribution
The authors propose a novel MLP-centric distillation framework for SID-based generative recommenders, reducing inference latency by over 8x while maintaining accuracy.
Findings
SID-MLP matches teacher model accuracy.
Inference is accelerated by 8.74x with SID-MLP.
SID-MLP++ extends to replace the encoder, further reducing latency.
Abstract
Generative recommendation models employing Semantic IDs (SIDs) exhibit strong potential, yet their practical deployment is bottlenecked by the high inference latency of beam-expanded autoregressive decoding. In this work, we identify that standard attention-heavy Transformer decoders represent a structural overkill for this task: the hierarchical nature of SIDs makes prediction difficulty drops sharply after the first token, rendering repeated attention computations highly redundant. Driven by this insight, we propose SID-MLP, a lightweight MLP-centric distillation framework that fundamentally simplifies the decoding paradigm for GR. Instead of executing complex, step-by-step attention mechanisms, our approach captures the global user context in a single operation, decoupled from sequential token prediction. We then distill the heavy autoregressive teacher into position-specific MLP…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
