AMPO: Active Multi-Preference Optimization for Self-play Preference Selection
Taneesh Gupta, Rahul Madhavan, Xuchao Zhang, Chetan Bansal, Saravan Rajmohan

TL;DR
AMPO introduces an active subset selection method for multi-preference optimization in language models, improving alignment by efficiently identifying diverse and informative responses for training, leading to state-of-the-art results.
Contribution
The paper presents a novel active subset selection technique for multi-preference optimization, enhancing language model alignment with theoretical guarantees and empirical improvements.
Findings
Achieves state-of-the-art results on AlpacaEval with Llama 8B and Mistral 7B.
Provides theoretical guarantees for reward maximization.
Effectively identifies diverse response modes for robust training.
Abstract
Multi-preference optimization enriches language-model alignment beyond pairwise preferences by contrasting entire sets of helpful and undesired responses, thereby enabling richer training signals for large language models. During self-play alignment, these models often produce numerous candidate answers per query, rendering it computationally infeasible to include all responses in the training objective. In this work, we propose (AMPO), a novel approach that combines on-policy generation, a multi-preference group-contrastive loss, and active subset selection. Specifically, we score and embed large candidate pools of responses and then select a small, yet informative, subset that covers reward extremes and distinct semantic clusters for preference optimization. Our contrastive training scheme is capable of identifying not only the best and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Multimodal Machine Learning Applications · Constraint Satisfaction and Optimization
MethodsLLaMA
