Improving Reward Models with Synthetic Critiques
Zihuiwen Ye, Fraser Greenlee-Scott, Max Bartolo, Phil Blunsom, Jon, Ander Campos, Matthias Gall\'e

TL;DR
This paper introduces a method to enhance reward models for language models by using synthetic critiques generated by large language models, leading to better performance, efficiency, and robustness.
Contribution
The paper proposes leveraging synthetic natural language critiques to improve reward models, reducing dependence on human annotations and enhancing generalization.
Findings
Synthetic critiques improve RM performance.
Reduced need for human-labeled data.
Enhanced robustness and interpretability.
Abstract
Reward models (RMs) play a critical role in aligning language models through the process of reinforcement learning from human feedback. RMs are trained to predict a score reflecting human preference, which requires significant time and cost for human annotation. Additionally, RMs tend to quickly overfit on superficial features in the training set, hindering their generalization performance on unseen distributions. We propose a novel approach using synthetic natural language critiques generated by large language models to provide additional feedback, evaluating aspects such as instruction following, correctness, and style. This offers richer signals and more robust features for RMs to assess and score on. We demonstrate that high-quality critiques improve the performance and data efficiency of RMs initialized from different pretrained models, reducing the reliance on costly human…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsDiverse Scientific and Economic Studies
