ComPO: Preference Alignment via Comparison Oracles
Peter Chen, Xi Chen, Wotao Yin, Tianyi Lin

TL;DR
This paper introduces ComPO, a novel preference alignment method using comparison oracles that effectively handles noisy preference pairs, improving LLM alignment with human preferences through a zeroth-order optimization approach.
Contribution
Proposes a new comparison-based preference alignment method with convergence guarantees and demonstrates its effectiveness on multiple models and benchmarks.
Findings
Effective in improving LLM performance with noisy preference data
Outperforms existing direct alignment methods in experiments
Highlights the importance of specialized methods for different preference pair likelihood margins
Abstract
Direct alignment methods are increasingly used for aligning large language models (LLMs) with human preferences. However, these methods suffer from the issues of verbosity and likelihood displacement, which can be driven by the noisy preference pairs that induce similar likelihood for preferred and dispreferred responses. The contributions of this paper are two-fold. First, we propose a new preference alignment method based on zeroth-order, comparison-based optimization via comparison oracles and provide convergence guarantees for its basic scheme. Second, we improve our method using some heuristics and conduct the experiments to demonstrate the flexibility and compatibility of practical scheme in improving the performance of LLMs using noisy preference pairs. Evaluations are conducted across multiple base and instruction-tuned models (Mistral-7B, Llama-3-8B and Gemma-2-9B) with…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsTopic Modeling · Recommender Systems and Techniques · Constraint Satisfaction and Optimization
MethodsBalanced Selection
