Self-Exploring Language Models: Active Preference Elicitation for Online   Alignment

Shenao Zhang; Donghan Yu; Hiteshi Sharma; Han Zhong; Zhihan Liu; Ziyi; Yang; Shuohang Wang; Hany Hassan; Zhaoran Wang

arXiv:2405.19332·cs.LG·November 6, 2024

Self-Exploring Language Models: Active Preference Elicitation for Online Alignment

Shenao Zhang, Donghan Yu, Hiteshi Sharma, Han Zhong, Zhihan Liu, Ziyi, Yang, Shuohang Wang, Hany Hassan, Zhaoran Wang

PDF

Open Access 1 Repo 8 Models

TL;DR

This paper introduces Self-Exploring Language Models (SELM), an active exploration method for online alignment of LLMs that improves reward modeling and instruction-following performance through iterative, biased exploration.

Contribution

SELM presents a bilevel optimization approach that enhances exploration efficiency and reduces the need for separate reward models in online LLM alignment.

Findings

01

SELM outperforms DPO in exploration and alignment quality.

02

Fine-tuning on Zephyr-7B-SFT and Llama-3-8B-Instruct improves benchmark scores.

03

Significant gains on MT-Bench, AlpacaEval 2.0, and academic benchmarks.

Abstract

Preference optimization, particularly through Reinforcement Learning from Human Feedback (RLHF), has achieved significant success in aligning Large Language Models (LLMs) to adhere to human intentions. Unlike offline alignment with a fixed dataset, online feedback collection from humans or AI on model generations typically leads to more capable reward models and better-aligned LLMs through an iterative process. However, achieving a globally accurate reward model requires systematic exploration to generate diverse responses that span the vast space of natural language. Random sampling from standard reward-maximizing LLMs alone is insufficient to fulfill this requirement. To address this issue, we propose a bilevel objective optimistically biased towards potentially high-reward responses to actively explore out-of-distribution regions. By solving the inner-level problem with the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

shenao-zhang/selm
pytorchOfficial

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsRecommender Systems and Techniques