Bootstrapping Language Models with DPO Implicit Rewards
Changyu Chen, Zichen Liu, Chao Du, Tianyu Pang, Qian Liu, Arunesh, Sinha, Pradeep Varakantham, Min Lin

TL;DR
This paper introduces DICE, a method that uses implicit rewards from a language model's own DPO training to iteratively improve alignment, achieving significant performance gains without external feedback.
Contribution
The paper presents a novel bootstrapping approach called DICE that leverages implicit DPO rewards for self-alignment of language models, enhancing alignment performance.
Findings
Achieved over 8% improvement in length-controlled win rate on AlpacaEval 2.
Demonstrated effectiveness across multiple base models.
Improved alignment without external human feedback.
Abstract
Human alignment in large language models (LLMs) is an active area of research. A recent groundbreaking work, direct preference optimization (DPO), has greatly simplified the process from past work in reinforcement learning from human feedback (RLHF) by bypassing the reward learning stage in RLHF. DPO, after training, provides an implicit reward model. In this work, we make a novel observation that this implicit reward model can by itself be used in a bootstrapping fashion to further align the LLM. Our approach is to use the rewards from a current LLM to construct a preference dataset, which is then used in subsequent DPO rounds. We incorporate two refinements to further improve our approach: 1) length-regularized reward shaping to make the preference dataset length-unbiased; 2) experience replay to enhance the quality of the preference dataset. Our approach, named self-alignment with…
Peer Reviews
Decision·ICLR 2025 Poster
The writing is clear and easy to follow, and the experiments provide support for the claim that repeated use of the implicit reward model could enhance performance.
The results offer a useful insight, though the approach itself may lack significant novelty, and the improvements seem to be marginal. Additionally, length-regularized reward shaping is already a widely adopted technique. While the paper seeks to improve model performance through Direct Preference Optimization (DPO) by leveraging the implicit reward model, it would benefit from a theoretical explanation of why repeated use of the implicit reward model could enhance performance. Moreover, it migh
The method is natural and straightforward to apply: using the current policy to generate both new on-policy samples and rewards provides and easy way to do iterative policy improvement with no additional models or data. I expect the method to be of great interest to practitioners who value this benefit, which is similar to the benefit of DPO compared to PPO for RLHF. The experiments are thorough and promising enough that I expect others will want to try out the method in their own settings, alt
A glaring methodological weakness is in the treatment of the hyperparameter gamma (1 minus the fraction of offline data used): if I am understanding correctly, the tuning was done using the final reported results. This is contrary to standard machine learning best practices, whereby hyperparameters are tuned using a validation set. This gives the method a significant advantage if the difference between different values of gamma has a significant noise component. The same applies to the hyperpara
Iterative finetuning on self-supervised data is an important direction, which has received a reasonable amount of attention, e.g., with Anthropic's Constitutional AI / RLAIF, and others. The DPO approach has also gained significant adoption, so improving DPO with iterative finetuning would be of interest to the community. The proposed procedure is straightforward and the results are good. The paper as a whole is well written and presented. I believe I could reimplement the method/experiments b
Main concern: - I remain unconvinced this is the best approach to iterative finetuning a DPO model, since the evaluation only tests downstream performance and other potential reward models that are known to be empirically stronger than implicit DPO rewards are ignored: If you look at RewardBench, the implicit rewards of DPO models are quite far down the list in terms of their strength as a reward model. Furthermore, reward models themselves are not very difficult to train, so the overhead as co
Code & Models
- 🤗sail/Zephyr-7B-DICE-Iter1model· 2 dl2 dl
- 🤗sail/Llama-3-Base-8B-DICE-Iter1model· 6 dl· ♡ 26 dl♡ 2
- 🤗sail/Llama-3-Base-8B-DICE-Iter2model· 7 dl· ♡ 37 dl♡ 3
- 🤗sail/Zephyr-7B-DICE-Iter2model· 4 dl· ♡ 24 dl♡ 2
- 🤗RichardErkhov/sail_-_Llama-3-Base-8B-DICE-Iter2-8bitsmodel· 1 dl1 dl
- 🤗RichardErkhov/sail_-_Llama-3-Base-8B-DICE-Iter2-awqmodel· 1 dl1 dl
- 🤗RichardErkhov/sail_-_Llama-3-Base-8B-DICE-Iter1-8bitsmodel
- 🤗RichardErkhov/sail_-_Llama-3-Base-8B-DICE-Iter1-awqmodel
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling
MethodsAttention Is All You Need · Direct Preference Optimization · Residual Connection · Softmax · ALIGN · Layer Normalization · Byte Pair Encoding · Label Smoothing · Adam · Linear Layer
