TL;DR
This paper introduces MORA, a novel multi-objective reward method that expands reward diversity by rewriting prompts, effectively overcoming the inherent trade-offs in aligning large language models with multiple human preferences.
Contribution
MORA fundamentally addresses the limitations of fixed Pareto frontiers by expanding reward dimensions through prompt rewriting, enabling better multi-objective alignment.
Findings
MORA improves single-preference scores by 5% to 12.4%.
MORA achieves a 4.6% average overall reward improvement.
Extensive experiments validate MORA's effectiveness in multi-objective alignment.
Abstract
In the realm of multi-objective alignment for large language models, balancing disparate human preferences often manifests as a zero-sum conflict. Specifically, the intrinsic tension between competing goals dictates that aggressively optimizing for one metric (e.g., helpfulness) frequently incurs a substantial penalty on another (e.g., harmlessness). While prior work mainly focuses on data selection, parameter merging, or algorithmic balancing during training, these approaches merely force compromises between divergent preferences along a fixed Pareto frontier, failing to fundamentally resolve the inherent trade-off. In this work, we approach this problem from a novel perspective of multi-dimensional rewards. By scaling up the model's rollouts and analyzing the outputs across different reward dimensions, we arrive at a critical conclusion: the conflict among multiple objectives stems…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
