Learning to summarize user information for personalized reinforcement learning from human feedback

Hyunji Nam; Yanming Wan; Mickel Liu; Peter Ahnn; Jianxun Lian; Natasha Jaques

arXiv:2507.13579·cs.LG·February 6, 2026

Learning to summarize user information for personalized reinforcement learning from human feedback

Hyunji Nam, Yanming Wan, Mickel Liu, Peter Ahnn, Jianxun Lian, Natasha Jaques

PDF

Open Access 3 Reviews

TL;DR

This paper introduces PLUS, a framework that personalizes reinforcement learning for language models by summarizing individual user preferences, leading to improved accuracy, robustness, and interpretability in user-specific responses.

Contribution

PLUS is a novel method that learns user preference summaries to condition reward models, enabling effective zero-shot personalization and better alignment with individual user goals.

Findings

01

Achieves 11-77% improvement in reward model accuracy.

02

Attains 25% better performance over existing personalized RLHF methods.

03

Reaches 72% win rate in zero-shot personalization with GPT-4.

Abstract

As everyday use cases of large language model (LLM) AI assistants have expanded, it is becoming increasingly important to personalize responses to align to different users' preferences and goals. While reinforcement learning from human feedback (RLHF) is effective at improving LLMs to be generally more helpful and fluent, it does not account for variability across users, as it models the entire user population with a single reward model, meaning it assumes that everyone's preferences are the same. We present a novel framework, Preference Learning Using Summarization (PLUS), that uses reinforcement learning (RL) to learn to produce text-based summaries of each user's preferences, characteristics, and past conversations. These summaries condition the reward model, enabling it to make personalized predictions about the types of responses valued by each user. Both the user-summarization…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 6Confidence 3

Strengths

* Adapting to a user's individual preference is an important problem * The method is well motivated and very reasonable * The results show a notable improvement in RM accuracy

Weaknesses

* The evaluation is limited to reward modeling, it was not tested whether the trained reward model can be used in a subsequent step to train a better policy. Recent research [1,2] has shown that a higher RM accuracy does not necessarily lead to a better performance after RL training. This is my main issue with this paper. The abstract also claims a "25% improvement over the best personalized RLHF technique". This is incorrect, at most the paper can claim such an improvement over the "best perso

Reviewer 02Rating 4Confidence 3

Strengths

# strengths 1. Representing user preferences via natural-language summaries makes the user representation easy to understand and inherently interpretable, while avoiding issues with overly long raw contexts. 2. PLUS is evaluated across diverse datasets and settings (including OOD), providing evidence of its effectiveness. 3. PLUS achieves improvements over baselines in most settings and even surpasses strong baselines such as GPT-4o and GPT-4.1.

Weaknesses

# weakness 1. Data sparsity per user. With only a few dialogs for each user (e.g., Pets uses 3, UltraFeedback uses 2–4, PRISM uses 3), can the reward model learn a reliable representation of that user’s preferences? Is the reward model potentially *under-converged* in such low-data regimes? 2. *Training stability.* Alternating optimization can be hard to converge. It would be important to provide *training curves for the reward model* to assess stability (e.g., accuracy, variance across steps).

Reviewer 03Rating 8Confidence 4

Strengths

This paper provides strong results showing that PLUS can better personalize. The authors show results on the PRISM dataset. this is a diverse, multicultural dataset and reflects what an LLM may encounter in the real world. Furthermore, PLUS shows robustness to new topics and users which is more difficult for approaches that, for example, assign a new user an id rather than a language-based summary.

Weaknesses

The authors could add some important details related to training such as convergence criteria. Furthermore, PPO in a MARL setup is highly sensitive to learning rate and other parameters. The authors should discuss the possibilty of instability in training in the limitations section. It would be interesting if the authors provided quantitative results to back up their claim that the personal embeddings dont capture information as well as the language summary - perhaps via mutual information meas

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsEvolutionary Algorithms and Applications · Muscle activation and electromyography studies