Don't Let Bandit Feedback Pull Continual LLM-Recommender Updates Off Target
Taesan Kim, Hyeongjun Yun, Jaegul Choo, Chung Park

TL;DR
This paper introduces ABPO, a novel framework for continual updates of LLM-based recommenders that effectively addresses exposure bias and feedback ambiguity using anchored policy optimization and confidence-based feedback treatment.
Contribution
The paper proposes ABPO, a new method combining group-relative policy optimization with explicit bias correction and feedback reliability treatment for continual LLM recommender updates.
Findings
ABPO improves recommendation accuracy across five domains.
It mitigates exposure bias more effectively than prior methods.
The approach effectively handles feedback ambiguity using confidence signals.
Abstract
Generative LLM-based recommenders (LLM-Rec) require continual post-deployment updates, yet deployment logs provide only policy-shaped contextual bandit feedback: outcomes are observed solely for items exposed by a prior serving policy, inducing exposure bias and yielding partial, asymmetric signals consisting of relatively reliable positive responses and ambiguous no-responses. We propose an Anchored Bandit Policy Optimization (ABPO) framework for continual LLM-Rec updates that combines group-relative policy optimization (GRPO) with explicit treatment of exposure bias and feedback ambiguity. Specifically, we insert the exposed recommendation as a logged anchor into each GRPO rollout group, so that group-relative normalization is calibrated against the action actually exposed by the prior policy rather than against newly sampled rollouts alone. Because both positive- and no-responses are…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
