Understanding the performance gap between online and offline alignment algorithms
Yunhao Tang, Daniel Zhaohan Guo, Zeyu Zheng, Daniele Calandriello,, Yuan Cao, Eugene Tarassov, R\'emi Munos, Bernardo \'Avila Pires, Michal, Valko, Yong Cheng, Will Dabney

TL;DR
This paper investigates why online alignment algorithms outperform offline ones in reinforcement learning from human feedback, revealing the importance of on-policy sampling and the limitations of offline methods.
Contribution
The study provides empirical evidence on the performance gap between online and offline alignment algorithms and explores underlying causes beyond data coverage and quality.
Findings
Online algorithms excel at generation tasks, offline at pairwise classification.
Offline data coverage and quality do not fully explain performance differences.
Sampling process critically impacts the discriminative and generative capabilities of policies.
Abstract
Reinforcement learning from human feedback (RLHF) is the canonical framework for large language model alignment. However, rising popularity in offline alignment algorithms challenge the need for on-policy sampling in RLHF. Within the context of reward over-optimization, we start with an opening set of experiments that demonstrate the clear advantage of online methods over offline methods. This prompts us to investigate the causes to the performance discrepancy through a series of carefully designed experimental ablations. We show empirically that hypotheses such as offline data coverage and data quality by itself cannot convincingly explain the performance difference. We also find that while offline algorithms train policy to become good at pairwise classification, it is worse at generations; in the meantime the policies trained by online algorithms are good at generations while worse…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenomics and Phylogenetic Studies · Gene expression and cancer classification · Algorithms and Data Compression
MethodsSparse Evolutionary Training
