Reward-free Alignment for Conflicting Objectives
Peter L. Chen, Xiaopeng Li, Xi Chen, Tianyi Lin

TL;DR
This paper introduces RACO, a reward-free framework for aligning large language models with conflicting objectives, using pairwise preferences and a novel gradient method to improve trade-offs and convergence.
Contribution
It proposes a new reward-free multi-objective alignment method with convergence guarantees and practical heuristics, improving Pareto trade-offs in LLM alignment tasks.
Findings
RACO achieves better Pareto trade-offs than existing methods.
Clipping improves convergence rate in two-objective cases.
Method is effective across multiple LLM families and tasks.
Abstract
Direct alignment methods are increasingly used to align large language models (LLMs) with human preferences. However, many real-world alignment problems involve multiple conflicting objectives, where naive aggregation of preferences can lead to unstable training and poor trade-offs. In particular, weighted loss methods may fail to identify update directions that simultaneously improve all objectives, and existing multi-objective approaches often rely on explicit reward models, introducing additional complexity and distorting user-specified preferences. The contributions of this paper are two-fold. First, we propose a Reward-free Alignment framework for Conflicted Objectives (RACO) that directly leverages pairwise preference data and resolves gradient conflicts via a novel clipped variant of conflict-averse gradient descent. We provide convergence guarantees to Pareto-critical points…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Constraint Satisfaction and Optimization · Recommender Systems and Techniques
