Towards Faithful and Controllable Personalization via Critique-Post-Edit Reinforcement Learning
Chenghao Zhu, Meiling Tao, Tiannan Wang, Dongyi Ding, Yuchen Eleanor Jiang, Wangchunshu Zhou

TL;DR
This paper introduces Critique-Post-Edit reinforcement learning, a novel framework for faithful and controllable personalization of large language models, addressing reward hacking and improving alignment with individual user preferences.
Contribution
It proposes a new RL framework with a multi-dimensional reward model and critique-post-edit mechanism, enhancing personalization fidelity and controllability over existing methods.
Findings
Outperforms standard PPO on personalization benchmarks.
Qwen2.5-7B achieves 11% win-rate improvement.
Qwen2.5-14B surpasses GPT-4.1 in personalization.
Abstract
Faithfully personalizing large language models (LLMs) to align with individual user preferences is a critical but challenging task. While supervised fine-tuning (SFT) quickly reaches a performance plateau, standard reinforcement learning from human feedback (RLHF) also struggles with the nuances of personalization. Scalar-based reward models are prone to reward hacking which leads to verbose and superficially personalized responses. To address these limitations, we propose Critique-Post-Edit, a robust reinforcement learning framework that enables more faithful and controllable personalization. Our framework integrates two key components: (1) a Personalized Generative Reward Model (GRM) that provides multi-dimensional scores and textual critiques to resist reward hacking, and (2) a Critique-Post-Edit mechanism where the policy model revises its own outputs based on these critiques for…
Peer Reviews
Decision·ICLR 2026 Conference Withdrawn Submission
1. Using a GRM that outputs both multi-dimensional scores and textual rationales, serving as a verifier that explains what to improve and why, which is more informative than single-scalar BT reward is novel. 2. The post-edit stage yields a diverse, targeted learning signal is a novel idea to mitigate reward hacking in the standard PPO. 3. Clear, easy-to-follow visual illustration of the framework and training loop. 4. Addresses a limitation “one-size-fits-all” personas and shallow personalizatio
1. Performance hinges on (1) the model’s ability to follow critiques, (2) the quality of GRM critiques, and (3) the consistency/calibration of generated scores; these are potential fragility points. 2. Although the GRM outputs multiple dimensions, optimization ultimately reduces them to a single scalar signal, potentially discarding nuance and only three dimensions are considered. 3. The GRM is trained with GPT-4o; behaviors could overfit to that judge, making the system hackable against its own
The problem of dealing with reward hacking in personalization is relevant and the idea presented is practically useful and scalable.
1. While the idea of using generative reward models instead of regular reward models to deal with reward hacking seems interesting and of practical relevance, the novelty is a bit weak. I recommend improving the paper with more significant contributions, one idea is coming up with a theoretical guarantee, another is to stress test and understand when the method works well and when it doesn't. 2. The benchmarks with the naive DPO and RLHF are too simplistic. I recommend adding more benchmarks, f
- The integration of a GRM with textual critiques is persuasive, providing nuanced, multi-faceted feedback that mitigates common RL pitfalls like verbosity or superficial personalization. This builds effectively on the prior concept of generative verifiers but tailors it to personalization, showing clear empirical benefits in length-controlled evaluations. - The use of length-debiased metrics (Dubois et al., 2024) and multiple benchmarks ensures fair comparisons, avoiding common biases in LLM-as
- Overall, the missing details in many parts of the paper make it hard to follow the context. For example, Section 3 that builds core motivation of the proposed approach, lacks any details about the specific experimental setup; it abruptly starts from "... we train with 18k samples" without any provenance or what the goal of this training is in the first place. In Section 4.1, despite the section being details about GRM, it only mentions the construction of training data but does not provide any
- The authors present comprehensive experiments across multiple datasets and scales, showing consistent gains and strong performance even compared to GPT-4.1. - The length-controlled evaluation and human validation strengthen the credibility of the reported results. - Ablation and sampling-strategy studies are detailed and help disentangle contributions between the GRM and data collection.
- The reward aggregation uses fixed weights, but no sensitivity or robustness analysis is provided. Since these weights directly control the reward, understanding how performance varies with them is essential. - The paper insufficiently situates itself within existing work: - From the methodological side, training models on post-critique edits has been extensively explored before (See [1] and many follow ups), yet this lineage is not discussed. - From the application side, no comparison i
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsRecommender Systems and Techniques · Topic Modeling · Machine Learning in Healthcare
