P-GenRM: Personalized Generative Reward Model with Test-time User-based Scaling
Pinyi Zhang, Ting-En Lin, Yuchuan Wu, Jingyang Chen, Zongqi Wang, Hua Yang, Ze Xu, Fei Huang, Kai Zhang, Yongbin Li

TL;DR
P-GenRM introduces a novel personalized reward model that uses test-time user-based scaling and clustering to improve alignment of language models with individual user preferences, achieving state-of-the-art results.
Contribution
It proposes P-GenRM, a personalized generative reward model with test-time user-based scaling, enabling better generalization and adaptation to individual user preferences.
Findings
Achieves 2.31% improvement on personalized reward benchmarks.
Demonstrates strong generalization on out-of-distribution data.
Test-time user-based scaling adds an extra 3% performance boost.
Abstract
Personalized alignment of large language models seeks to adapt responses to individual user preferences, typically via reinforcement learning. A key challenge is obtaining accurate, user-specific reward signals in open-ended scenarios. Existing personalized reward models face two persistent limitations: (1) oversimplifying diverse, scenario-specific preferences into a small, fixed set of evaluation principles, and (2) struggling with generalization to new users with limited feedback. To this end, we propose P-GenRM, the first Personalized Generative Reward Model with test-time user-based scaling. P-GenRM transforms preference signals into structured evaluation chains that derive adaptive personas and scoring rubrics across various scenarios. It further clusters users into User Prototypes and introduces a dual-granularity scaling mechanism: at the individual level, it adaptively scales…
Peer Reviews
Decision·ICLR 2026 Oral
The paper is generally a solid paper. The strengths include: 1. The motivation is clear and important: personalized preference modeling is a real bottleneck for aligning LLM outputs to individual users. 2. Conceptually, the method advances online preference handling by turning user evidence into a contextual persona and rubric, and by scaling judgments at test time with both multiple runs for the same user and signals from similar users; this is a novel and well-argued design. 3. The approa
There are still some weaknesses listed as below. 1. The method in the main text is too abstract. Key I/O and losses are not clearly written there or clearly pointed to the appendix. The algorithm, including both training and test-time, is generally complex, and I have some questions to be answered in the question section. 2. Computation cost is underreported. I assume the additional test-time user-based scaling may take much more than the baselines. A computation cost ablation study may be nec
1. This paper address an important problem of user personalization by providing the full pipeline of collecting data, clustering users, refining the personalized reward model and adapts the output. 2. The experimental evaluation is comprehensive. 3. The pipeline uses both implicit and explicit preference signals, which fully utilizes the preference dataset.
1. The main concern is the limited novelty of the paper. It seems that the main contribution of this paper is proposing the overall pipeline of obtaining personalized outputs, by using existing methods such as generative reward models and clustering users. It is not very clear what the technical contributions are. 2. Lack of analysis of inference costs. It would be nice if some analysis on the costs of personalization can be done, including analysis of the baselines.
This paper has various strengths, particularly I wish to point out: - The achieved results on first look are indeed promising, especially Table 1, which makes a strong case for their proposed method - The experiments, especially the ablation experiments, are extensive and cover most of the questions I had while reading this paper. - Their overview figures (1 and 2) help to understand what is happening in their method.
I wish to point out some weaknesses that this paper has, upon which the authors could improve to make a stronger case: - **missing errorbars**: While Table 1 reports the error bars (I assume this is the standard error?), all other experiments do not report any error bars, which makes it difficult to gauge the statistical significance of the experiments. Especially the results in Table 3 and Figure 3b could have overlapping errors. - **Composition of many methods**: While I appreciate the work
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsRecommender Systems and Techniques · Persona Design and Applications · Machine Learning in Healthcare
