Language Model Alignment with Elastic Reset
Michael Noukhovitch, Samuel Lavoie, Florian Strub, Aaron Courville

TL;DR
This paper introduces Elastic Reset, a novel algorithm for fine-tuning language models that improves reward achievement while reducing drift, without explicit objective modifications, demonstrated through various benchmarks.
Contribution
Elastic Reset is a new method that periodically resets the model and its EMA to enhance reward and reduce drift without changing the training objective.
Findings
Achieves higher reward with less drift compared to standard methods.
Outperforms baselines on pivot-translation and sentiment tasks.
Produces a more aligned and effective QA chatbot.
Abstract
Finetuning language models with reinforcement learning (RL), e.g. from human feedback (HF), is a prominent method for alignment. But optimizing against a reward model can improve on reward while degrading performance in other areas, a phenomenon known as reward hacking, alignment tax, or language drift. First, we argue that commonly-used test metrics are insufficient and instead measure how different algorithms tradeoff between reward and drift. The standard method modified the reward with a Kullback-Lieber (KL) penalty between the online and initial model. We propose Elastic Reset, a new algorithm that achieves higher reward with less drift without explicitly modifying the training objective. We periodically reset the online model to an exponentially moving average (EMA) of itself, then reset the EMA model to the initial model. Through the use of an EMA, our model recovers quickly…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Speech and dialogue systems
