Online Preference Alignment for Language Models via Count-based   Exploration

Chenjia Bai; Yang Zhang; Shuang Qiu; Qiaosheng Zhang; Kang Xu; Xuelong; Li

arXiv:2501.12735·cs.LG·February 10, 2025

Online Preference Alignment for Language Models via Count-based Exploration

Chenjia Bai, Yang Zhang, Shuang Qiu, Qiaosheng Zhang, Kang Xu, Xuelong, Li

PDF

Open Access 1 Repo

TL;DR

This paper introduces COPO, a count-based online RLHF method that improves language model alignment by encouraging exploration and expanding data coverage through a simple counting mechanism.

Contribution

It proposes a novel count-based exploration bonus for online RLHF, providing theoretical motivation and a practical algorithm to enhance LLM preference alignment.

Findings

01

COPO significantly improves instruction-following performance.

02

The method increases data coverage and exploration in online RLHF.

03

Experimental results on Zephyr and Llama-3 show superior performance.

Abstract

Reinforcement Learning from Human Feedback (RLHF) has shown great potential in fine-tuning Large Language Models (LLMs) to align with human preferences. Existing methods perform preference alignment from a fixed dataset, which can be limited in data coverage, and the resulting reward model is hard to generalize in out-of-distribution responses. Thus, online RLHF is more desirable to empower the LLM to explore outside the support of the initial dataset by iteratively collecting the prompt-response pairs. In this paper, we study the fundamental problem in online RLHF, i.e. \emph{how to explore} for LLM. We give a theoretical motivation in linear reward assumption to show that an optimistic reward with an upper confidence bound (UCB) term leads to a provably efficient RLHF policy. Then, we reformulate our objective to direct preference optimization with an exploration term, where the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

baichenjia/copo
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsData Management and Algorithms · Natural Language Processing Techniques · Recommender Systems and Techniques

MethodsALIGN