Robust Post-Training for Generative Recommenders: Why Exponential Reward-Weighted SFT Outperforms RLHF

Keertana Chidambaram; Sanath Kumar Krishnamurthy; Qiuling Xu; Ko-Jen Hsiao; Moumita Bhattacharya

arXiv:2603.10279·cs.LG·March 12, 2026

Robust Post-Training for Generative Recommenders: Why Exponential Reward-Weighted SFT Outperforms RLHF

Keertana Chidambaram, Sanath Kumar Krishnamurthy, Qiuling Xu, Ko-Jen Hsiao, Moumita Bhattacharya

PDF

Open Access

TL;DR

This paper introduces exponential reward-weighted supervised fine-tuning (SFT) for generative recommenders, which outperforms RLHF by being offline, robust to noisy rewards, and theoretically justified, improving recommendation quality at scale.

Contribution

The paper demonstrates that exponential reward-weighted SFT is theoretically sound and empirically superior for post-training generative recommenders, addressing limitations of existing methods.

Findings

01

Exponential reward weighting outperforms RLHF in recommendation tasks.

02

The method is robust to noisy rewards and does not require propensity scores.

03

Experiments show consistent improvements across datasets.

Abstract

Aligning generative recommender systems to user preferences via post-training is critical for closing the gap between next-item prediction and actual recommendation quality. Existing post-training methods are ill-suited for production-scale systems: RLHF methods reward hack due to noisy user feedback and unreliable reward models, offline RL alternatives require propensity scores that are unavailable, and online interaction is infeasible. We identify exponential reward-weighted SFT with weights $w = exp (r / λ)$ as uniquely suited to this setting, and provide the theoretical and empirical foundations that explain why. By optimizing directly on observed rewards without querying a learned reward model, the method is immune to reward hacking, requires no propensity scores, and is fully offline. We prove the first policy improvement guarantees for this setting under noisy rewards,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsRecommender Systems and Techniques · Explainable Artificial Intelligence (XAI) · Advanced Bandit Algorithms Research