Preference-Guided Reflective Sampling for Aligning Language Models
Hai Ye, Hwee Tou Ng

TL;DR
This paper introduces Preference-Guided Reflective Sampling (PRS), a novel sampling method that improves the alignment of large language models to human preferences through adaptive self-refinement and natural language preference specification.
Contribution
PRS offers a tree-based, adaptive sampling framework that enhances response quality and preference alignment in large language models, outperforming traditional random sampling methods.
Findings
PRS generates higher-quality responses with better rewards.
PRS outperforms repeated random sampling in best-of-N scenarios.
PRS shows strong performance in iterative offline RL training.
Abstract
Iterative data generation and model re-training can effectively align large language models(LLMs) to human preferences. The process of data sampling is crucial, as it significantly influences the success of policy improvement. Repeated random sampling is a widely used method that independently queries the model multiple times to generate outputs. In this work, we propose a more effective sampling method, named Preference-Guided Reflective Sampling (PRS). Unlike random sampling, PRS employs a tree-based generation framework to enable more efficient sampling. It leverages adaptive self-refinement techniques to better explore the sampling space. By specifying user preferences in natural language, PRS can further optimize response generation according to these preferences. As a result, PRS can align models to diverse user preferences. Our experiments demonstrate that PRS generates…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling
MethodsALIGN
