Countering Reward Over-optimization in LLM with Demonstration-Guided   Reinforcement Learning

Mathieu Rita; Florian Strub; Rahma Chaabouni; Paul Michel; Emmanuel; Dupoux; Olivier Pietquin

arXiv:2404.19409·cs.CL·May 1, 2024

Countering Reward Over-optimization in LLM with Demonstration-Guided Reinforcement Learning

Mathieu Rita, Florian Strub, Rahma Chaabouni, Paul Michel, Emmanuel, Dupoux, Olivier Pietquin

PDF

Open Access 1 Repo

TL;DR

This paper introduces Reward Calibration from Demonstration (RCfD), a novel method that uses human demonstrations and reward models to recalibrate reward functions, effectively mitigating reward over-optimization in large language models without extensive hyperparameter tuning.

Contribution

The paper proposes RCfD, a demonstration-guided reinforcement learning approach that recalibrates reward functions to prevent over-optimization, offering a more natural language generation and reducing reliance on KL regularization.

Findings

01

RCfD effectively mitigates reward over-optimization in LLMs.

02

RCfD achieves comparable performance to tuned baselines.

03

RCfD promotes diverse and natural language generation.

Abstract

While Reinforcement Learning (RL) has been proven essential for tuning large language models (LLMs), it can lead to reward over-optimization (ROO). Existing approaches address ROO by adding KL regularization, requiring computationally expensive hyperparameter tuning. Additionally, KL regularization focuses solely on regularizing the language policy, neglecting a potential source of regularization: the reward function itself. Inspired by demonstration-guided RL, we here introduce the Reward Calibration from Demonstration (RCfD), which leverages human demonstrations and a reward model to recalibrate the reward objective. Formally, given a prompt, the RCfD objective minimizes the distance between the demonstrations' and LLM's rewards rather than directly maximizing the reward function. This objective shift avoids incentivizing the LLM to exploit the reward model and promotes more natural…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

mathieurita/llm_demonstration_guided_rl
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsScheduling and Optimization Algorithms · Reinforcement Learning in Robotics