CREAM: Consistency Regularized Self-Rewarding Language Models
Zhaoyang Wang, Weilei He, Zhiyuan Liang, Xuchao Zhang, Chetan Bansal,, Ying Wei, Weitong Zhang, Huaxiu Yao

TL;DR
CREAM introduces a regularization technique for self-rewarding language models that leverages reward consistency across iterations, significantly enhancing alignment quality and reliability without human preference data.
Contribution
It proposes a novel regularization method based on reward consistency to improve self-rewarding LLMs, addressing bias and reliability issues in iterative preference fine-tuning.
Findings
CREAM improves reward consistency in self-rewarding LLMs.
CREAM outperforms baseline models in alignment tasks.
Regularization enhances the reliability of preference data.
Abstract
Recent self-rewarding large language models (LLM) have successfully applied LLM-as-a-Judge to iteratively improve the alignment performance without the need of human annotations for preference data. These methods commonly utilize the same LLM to act as both the policy model (which generates responses) and the reward model (which scores and ranks those responses). The ranked responses are then used as preference pairs to train the LLM via direct alignment technologies (e.g. DPO). However, it is noteworthy that throughout this process, there is no guarantee of accuracy in the rewarding and ranking, which is critical for ensuring accurate rewards and high-quality preference data. Empirical results from relatively small LLMs (e.g., 7B parameters) also indicate that improvements from self-rewarding may diminish after several iterations in certain situations, which we hypothesize is due to…
Peer Reviews
Decision·ICLR 2025 Poster
- The proposed method is interesting and there are theoretical and empirical proofs of effectiveness. - The method uses the ranking correlation to evaluate the consistency or uncertainty in self rewarding.
- I think first the author should add some baselines such as the original form of the method which just uses the KL constraint toward the Bernoulli distribution. It would be good to break down where the improvement comes from and how much the ranking correlation helps the methods. - Then another very important experiment I think is that previous self-rewarding methods can not maintain the improvement beyond 3 or 4 iterations, how will your method help this? Will there still be significant impro
1. The methodology section is well described with mathematical analyses for readers. 2. Formulating the challenges and corresponding analysis is organized and explained well. 3. Various experiments are conducted to show the performance of CREAM with models like LLama2 and 3, including comparison with reasonable baselines like SRLMs and external reward models.
1. In the line #244, there is a lack of information about the reason for using reversed preference order. 2. Once again, in the line #328, The reasons for preparing a reverse DPO dataset and swapping the best response with the worst response are somewhat unclear.
* The idea of considering diversity across an ensemble of RMs to prevent reward hacking is sound and well motivated. * Consideration of such a signal is supported by the experimental results.
* The biggest weakness of this paper is that it is not contextualized well with prior work on reward hacking and using ensembles to mitigate this. Given that proposing variants of DPO and related algorithms is increasingly crowded, this is especially important. Beyond discussing and comparing with this prior work, it is also important to understand how pessimism-based approaches compare, both theoretically and empirically, with the proposed methods based on agreement metrics. Similarly to this p
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques
