RL with KL penalties is better viewed as Bayesian inference
Tomasz Korbak, Ethan Perez, Christopher L Buckley

TL;DR
This paper reinterprets KL-regularized reinforcement learning for language model fine-tuning as Bayesian inference, providing insights into its advantages over traditional RL and highlighting the limitations of RL in this context.
Contribution
It demonstrates that KL-regularized RL is equivalent to Bayesian inference, offering a more principled understanding and addressing distribution collapse issues in language model fine-tuning.
Findings
KL-regularized RL avoids distribution collapse by aligning with Bayesian inference.
Reinterpreting RL as Bayesian inference provides deeper insights into fine-tuning objectives.
Other fine-tuning objectives beyond RL can be derived from Bayesian principles.
Abstract
Reinforcement learning (RL) is frequently employed in fine-tuning large language models (LMs), such as GPT-3, to penalize them for undesirable features of generated sequences, such as offensiveness, social bias, harmfulness or falsehood. The RL formulation involves treating the LM as a policy and updating it to maximise the expected value of a reward function which captures human preferences, such as non-offensiveness. In this paper, we analyze challenges associated with treating a language model as an RL policy and show how avoiding those challenges requires moving beyond the RL paradigm. We start by observing that the standard RL approach is flawed as an objective for fine-tuning LMs because it leads to distribution collapse: turning the LM into a degenerate distribution. Then, we analyze KL-regularised RL, a widely used recipe for fine-tuning LMs, which additionally constrains the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Adversarial Robustness in Machine Learning · Explainable Artificial Intelligence (XAI)
Methods15 Ways to Contact How can i speak to someone at Delta Airlines · Multi-Head Attention · Attention Is All You Need · Linear Layer · Weight Decay · Byte Pair Encoding · Dense Connections · {Dispute@FaQ-s}How to file a dispute with Expedia? · Dropout · Cosine Annealing
