Self-Exploring Language Models: Active Preference Elicitation for Online Alignment
Shenao Zhang, Donghan Yu, Hiteshi Sharma, Han Zhong, Zhihan Liu, Ziyi, Yang, Shuohang Wang, Hany Hassan, Zhaoran Wang

TL;DR
This paper introduces Self-Exploring Language Models (SELM), an active exploration method for online alignment of LLMs that improves reward modeling and instruction-following performance through iterative, biased exploration.
Contribution
SELM presents a bilevel optimization approach that enhances exploration efficiency and reduces the need for separate reward models in online LLM alignment.
Findings
SELM outperforms DPO in exploration and alignment quality.
Fine-tuning on Zephyr-7B-SFT and Llama-3-8B-Instruct improves benchmark scores.
Significant gains on MT-Bench, AlpacaEval 2.0, and academic benchmarks.
Abstract
Preference optimization, particularly through Reinforcement Learning from Human Feedback (RLHF), has achieved significant success in aligning Large Language Models (LLMs) to adhere to human intentions. Unlike offline alignment with a fixed dataset, online feedback collection from humans or AI on model generations typically leads to more capable reward models and better-aligned LLMs through an iterative process. However, achieving a globally accurate reward model requires systematic exploration to generate diverse responses that span the vast space of natural language. Random sampling from standard reward-maximizing LLMs alone is insufficient to fulfill this requirement. To address this issue, we propose a bilevel objective optimistically biased towards potentially high-reward responses to actively explore out-of-distribution regions. By solving the inner-level problem with the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗RichardErkhov/ZhangShenao_-_SELM-Llama-3-8B-Instruct-iter-3-ggufmodel· 93 dl93 dl
- 🤗RichardErkhov/ZhangShenao_-_SELM-Llama-3-8B-Instruct-iter-2-ggufmodel· 95 dl95 dl
- 🤗RichardErkhov/ZhangShenao_-_SELM-Llama-3-8B-Instruct-iter-1-ggufmodel· 158 dl158 dl
- 🤗RichardErkhov/ZhangShenao_-_SELM-Phi-3-mini-4k-instruct-iter-1-8bitsmodel· 5 dl5 dl
- 🤗RichardErkhov/ZhangShenao_-_SELM-Phi-3-mini-4k-instruct-iter-2-4bitsmodel· 1 dl1 dl
- 🤗RichardErkhov/ZhangShenao_-_SELM-Phi-3-mini-4k-instruct-iter-2-8bitsmodel· 1 dl1 dl
- 🤗RichardErkhov/ZhangShenao_-_SELM-Phi-3-mini-4k-instruct-iter-3-4bitsmodel· 1 dl1 dl
- 🤗RichardErkhov/ZhangShenao_-_SELM-Phi-3-mini-4k-instruct-iter-3-8bitsmodel· 1 dl1 dl
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsRecommender Systems and Techniques
