Towards Faithful and Controllable Personalization via Critique-Post-Edit Reinforcement Learning

Chenghao Zhu; Meiling Tao; Tiannan Wang; Dongyi Ding; Yuchen Eleanor Jiang; Wangchunshu Zhou

arXiv:2510.18849·cs.CL·October 22, 2025

Towards Faithful and Controllable Personalization via Critique-Post-Edit Reinforcement Learning

Chenghao Zhu, Meiling Tao, Tiannan Wang, Dongyi Ding, Yuchen Eleanor Jiang, Wangchunshu Zhou

PDF

Open Access 4 Reviews

TL;DR

This paper introduces Critique-Post-Edit reinforcement learning, a novel framework for faithful and controllable personalization of large language models, addressing reward hacking and improving alignment with individual user preferences.

Contribution

It proposes a new RL framework with a multi-dimensional reward model and critique-post-edit mechanism, enhancing personalization fidelity and controllability over existing methods.

Findings

01

Outperforms standard PPO on personalization benchmarks.

02

Qwen2.5-7B achieves 11% win-rate improvement.

03

Qwen2.5-14B surpasses GPT-4.1 in personalization.

Abstract

Faithfully personalizing large language models (LLMs) to align with individual user preferences is a critical but challenging task. While supervised fine-tuning (SFT) quickly reaches a performance plateau, standard reinforcement learning from human feedback (RLHF) also struggles with the nuances of personalization. Scalar-based reward models are prone to reward hacking which leads to verbose and superficially personalized responses. To address these limitations, we propose Critique-Post-Edit, a robust reinforcement learning framework that enables more faithful and controllable personalization. Our framework integrates two key components: (1) a Personalized Generative Reward Model (GRM) that provides multi-dimensional scores and textual critiques to resist reward hacking, and (2) a Critique-Post-Edit mechanism where the policy model revises its own outputs based on these critiques for…

Peer Reviews

Decision·ICLR 2026 Conference Withdrawn Submission

Reviewer 01Rating 2Confidence 4

Strengths

1. Using a GRM that outputs both multi-dimensional scores and textual rationales, serving as a verifier that explains what to improve and why, which is more informative than single-scalar BT reward is novel. 2. The post-edit stage yields a diverse, targeted learning signal is a novel idea to mitigate reward hacking in the standard PPO. 3. Clear, easy-to-follow visual illustration of the framework and training loop. 4. Addresses a limitation “one-size-fits-all” personas and shallow personalizatio

Weaknesses

1. Performance hinges on (1) the model’s ability to follow critiques, (2) the quality of GRM critiques, and (3) the consistency/calibration of generated scores; these are potential fragility points. 2. Although the GRM outputs multiple dimensions, optimization ultimately reduces them to a single scalar signal, potentially discarding nuance and only three dimensions are considered. 3. The GRM is trained with GPT-4o; behaviors could overfit to that judge, making the system hackable against its own

Reviewer 02Rating 2Confidence 4

Strengths

The problem of dealing with reward hacking in personalization is relevant and the idea presented is practically useful and scalable.

Weaknesses

1. While the idea of using generative reward models instead of regular reward models to deal with reward hacking seems interesting and of practical relevance, the novelty is a bit weak. I recommend improving the paper with more significant contributions, one idea is coming up with a theoretical guarantee, another is to stress test and understand when the method works well and when it doesn't. 2. The benchmarks with the naive DPO and RLHF are too simplistic. I recommend adding more benchmarks, f

Reviewer 03Rating 4Confidence 4

Strengths

- The integration of a GRM with textual critiques is persuasive, providing nuanced, multi-faceted feedback that mitigates common RL pitfalls like verbosity or superficial personalization. This builds effectively on the prior concept of generative verifiers but tailors it to personalization, showing clear empirical benefits in length-controlled evaluations. - The use of length-debiased metrics (Dubois et al., 2024) and multiple benchmarks ensures fair comparisons, avoiding common biases in LLM-as

Weaknesses

- Overall, the missing details in many parts of the paper make it hard to follow the context. For example, Section 3 that builds core motivation of the proposed approach, lacks any details about the specific experimental setup; it abruptly starts from "... we train with 18k samples" without any provenance or what the goal of this training is in the first place. In Section 4.1, despite the section being details about GRM, it only mentions the construction of training data but does not provide any

Reviewer 04Rating 4Confidence 4

Strengths

- The authors present comprehensive experiments across multiple datasets and scales, showing consistent gains and strong performance even compared to GPT-4.1. - The length-controlled evaluation and human validation strengthen the credibility of the reported results. - Ablation and sampling-strategy studies are detailed and help disentangle contributions between the GRM and data collection.

Weaknesses

- The reward aggregation uses fixed weights, but no sensitivity or robustness analysis is provided. Since these weights directly control the reward, understanding how performance varies with them is essential. - The paper insufficiently situates itself within existing work: - From the methodological side, training models on post-critique edits has been extensively explored before (See [1] and many follow ups), yet this lineage is not discussed. - From the application side, no comparison i

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsRecommender Systems and Techniques · Topic Modeling · Machine Learning in Healthcare