Sample Efficient Preference Alignment in LLMs via Active Exploration
Viraj Mehta, Syrine Belakaria, Vikramjeet Das, Ojash Neopane, and Yijia Dai, Ilija Bogunovic, Barbara Engelhardt, Stefano Ermon, and Jeff Schneider, Willie Neiswanger

TL;DR
This paper introduces an active exploration method for preference alignment in large language models, reducing human feedback costs through a formal dueling bandit framework with proven regret bounds.
Contribution
It formalizes preference alignment as an active contextual dueling bandit problem and proposes an efficient algorithm with theoretical guarantees, extending it for practical LLM use.
Findings
Outperforms baselines with limited human preference samples
Effective on multiple language models and datasets
Contributes two new real-world datasets
Abstract
Preference-based feedback is important for many applications in machine learning where evaluation of a reward function is not feasible. Notable recent examples arise in preference alignment for large language models, including in reinforcement learning from human feedback (RLHF) and direct preference optimization (DPO). For many applications of preference alignment, the cost of acquiring human feedback can be substantial. In this work, we take advantage of the fact that one can often choose contexts at which to obtain human feedback to most efficiently identify a good policy, and formalize the setting as an active contextual dueling bandit problem. We propose an active exploration algorithm to efficiently select the data and provide theoretical proof that it has a polynomial worst-case regret bound. We extend the setting and methodology for practical use in preference alignment of large…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Speech and dialogue systems · Machine Learning and Algorithms
