Robust Post-Training for Generative Recommenders: Why Exponential Reward-Weighted SFT Outperforms RLHF
Keertana Chidambaram, Sanath Kumar Krishnamurthy, Qiuling Xu, Ko-Jen Hsiao, Moumita Bhattacharya

TL;DR
This paper introduces exponential reward-weighted supervised fine-tuning (SFT) for generative recommenders, which outperforms RLHF by being offline, robust to noisy rewards, and theoretically justified, improving recommendation quality at scale.
Contribution
The paper demonstrates that exponential reward-weighted SFT is theoretically sound and empirically superior for post-training generative recommenders, addressing limitations of existing methods.
Findings
Exponential reward weighting outperforms RLHF in recommendation tasks.
The method is robust to noisy rewards and does not require propensity scores.
Experiments show consistent improvements across datasets.
Abstract
Aligning generative recommender systems to user preferences via post-training is critical for closing the gap between next-item prediction and actual recommendation quality. Existing post-training methods are ill-suited for production-scale systems: RLHF methods reward hack due to noisy user feedback and unreliable reward models, offline RL alternatives require propensity scores that are unavailable, and online interaction is infeasible. We identify exponential reward-weighted SFT with weights as uniquely suited to this setting, and provide the theoretical and empirical foundations that explain why. By optimizing directly on observed rewards without querying a learned reward model, the method is immune to reward hacking, requires no propensity scores, and is fully offline. We prove the first policy improvement guarantees for this setting under noisy rewards,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsRecommender Systems and Techniques · Explainable Artificial Intelligence (XAI) · Advanced Bandit Algorithms Research
