Preference Fine-Tuning of LLMs Should Leverage Suboptimal, On-Policy   Data

Fahim Tajwar; Anikait Singh; Archit Sharma; Rafael Rafailov; Jeff; Schneider; Tengyang Xie; Stefano Ermon; Chelsea Finn; Aviral Kumar

arXiv:2404.14367·cs.LG·June 4, 2024·3 cites

Preference Fine-Tuning of LLMs Should Leverage Suboptimal, On-Policy Data

Fahim Tajwar, Anikait Singh, Archit Sharma, Rafael Rafailov, Jeff, Schneider, Tengyang Xie, Stefano Ermon, Chelsea Finn, Aviral Kumar

PDF

Open Access 1 Repo

TL;DR

This paper analyzes various preference fine-tuning methods for large language models, finding that on-policy sampling and negative gradient approaches outperform offline and maximum likelihood methods, guiding better data collection strategies.

Contribution

It provides a rigorous analysis showing that mode-seeking objectives with on-policy sampling are more effective for preference fine-tuning of LLMs than traditional offline methods.

Findings

01

On-policy sampling outperforms offline methods.

02

Negative gradient approaches are more effective.

03

Mode-seeking objectives enable faster probability mass shifts.

Abstract

Learning from preference labels plays a crucial role in fine-tuning large language models. There are several distinct approaches for preference fine-tuning, including supervised learning, on-policy reinforcement learning (RL), and contrastive learning. Different methods come with different implementation tradeoffs and performance differences, and existing empirical findings present different conclusions, for instance, some results show that online RL is quite important to attain good fine-tuning results, while others find (offline) contrastive or even purely supervised methods sufficient. This raises a natural question: what kind of approaches are important for fine-tuning with preference data and why? In this paper, we answer this question by performing a rigorous analysis of a number of fine-tuning techniques on didactic and full-scale LLM problems. Our main finding is that, in…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Asap7772/understanding-rlhf
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsEfficiency Analysis Using DEA · Healthcare Policy and Management · Auction Theory and Applications