Contrastive Policy Gradient: Aligning LLMs on sequence-level scores in a   supervised-friendly fashion

Yannis Flet-Berliac; Nathan Grinsztajn; Florian Strub; Bill Wu; Eugene; Choi; Chris Cremer; Arash Ahmadian; Yash Chandak; Mohammad Gheshlaghi Azar,; Olivier Pietquin; Matthieu Geist

arXiv:2406.19185·cs.LG·January 17, 2025

Contrastive Policy Gradient: Aligning LLMs on sequence-level scores in a supervised-friendly fashion

Yannis Flet-Berliac, Nathan Grinsztajn, Florian Strub, Bill Wu, Eugene, Choi, Chris Cremer, Arash Ahmadian, Yash Chandak, Mohammad Gheshlaghi Azar,, Olivier Pietquin, Matthieu Geist

PDF

Open Access

TL;DR

This paper introduces Contrastive Policy Gradient (CoPG), a new reinforcement learning algorithm for fine-tuning large language models that can optimize arbitrary sequence-level rewards using off-policy data, improving alignment with desired outcomes.

Contribution

The paper presents CoPG, a mathematically principled off-policy policy gradient method that generalizes existing approaches and does not rely on importance sampling, enabling more efficient LLM fine-tuning.

Findings

01

CoPG effectively estimates optimal policies from off-policy data.

02

It outperforms traditional policy gradient methods in experiments.

03

Demonstrated successful LLM fine-tuning on summarization with learned rewards.

Abstract

Reinforcement Learning (RL) has been used to finetune Large Language Models (LLMs) using a reward model trained from preference data, to better align with human judgment. The recently introduced direct alignment methods, which are often simpler, more stable, and computationally lighter, can more directly achieve this. However, these approaches cannot optimize arbitrary rewards, and the preference-based ones are not the only rewards of interest for LLMs (eg., unit tests for code generation or textual entailment for summarization, among others). RL-finetuning is usually done with a variation of policy gradient, which calls for on-policy or near-on-policy samples, requiring costly generations. We introduce Contrastive Policy Gradient, or CoPG, a simple and mathematically principled new RL algorithm that can estimate the optimal policy even from off-policy data. It can be seen as an…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsEducational Assessment and Improvement

MethodsALIGN