Understanding the performance gap between online and offline alignment   algorithms

Yunhao Tang; Daniel Zhaohan Guo; Zeyu Zheng; Daniele Calandriello,; Yuan Cao; Eugene Tarassov; R\'emi Munos; Bernardo \'Avila Pires; Michal; Valko; Yong Cheng; Will Dabney

arXiv:2405.08448·cs.LG·May 15, 2024·3 cites

Understanding the performance gap between online and offline alignment algorithms

Yunhao Tang, Daniel Zhaohan Guo, Zeyu Zheng, Daniele Calandriello,, Yuan Cao, Eugene Tarassov, R\'emi Munos, Bernardo \'Avila Pires, Michal, Valko, Yong Cheng, Will Dabney

PDF

Open Access 1 Datasets

TL;DR

This paper investigates why online alignment algorithms outperform offline ones in reinforcement learning from human feedback, revealing the importance of on-policy sampling and the limitations of offline methods.

Contribution

The study provides empirical evidence on the performance gap between online and offline alignment algorithms and explores underlying causes beyond data coverage and quality.

Findings

01

Online algorithms excel at generation tasks, offline at pairwise classification.

02

Offline data coverage and quality do not fully explain performance differences.

03

Sampling process critically impacts the discriminative and generative capabilities of policies.

Abstract

Reinforcement learning from human feedback (RLHF) is the canonical framework for large language model alignment. However, rising popularity in offline alignment algorithms challenge the need for on-policy sampling in RLHF. Within the context of reward over-optimization, we start with an opening set of experiments that demonstrate the clear advantage of online methods over offline methods. This prompts us to investigate the causes to the performance discrepancy through a series of carefully designed experimental ablations. We show empirically that hypotheses such as offline data coverage and data quality by itself cannot convincingly explain the performance difference. We also find that while offline algorithms train policy to become good at pairwise classification, it is worse at generations; in the meantime the policies trained by online algorithms are good at generations while worse…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

misovalko/my-research-papers
dataset· 21 dl
21 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenomics and Phylogenetic Studies · Gene expression and cancer classification · Algorithms and Data Compression

MethodsSparse Evolutionary Training