Explaining and Breaking the Safety-Helpfulness Ceiling via Preference Dimensional Expansion

ShiYing Huang; Liang Lin; Yuer Li; Kaiwen Luo; Zhenhong Zhou; An Zhang; Junhao Dong; Kun Wang; Zhigang Zeng

arXiv:2605.11679·cs.AI·May 14, 2026

Explaining and Breaking the Safety-Helpfulness Ceiling via Preference Dimensional Expansion

ShiYing Huang, Liang Lin, Yuer Li, Kaiwen Luo, Zhenhong Zhou, An Zhang, Junhao Dong, Kun Wang, Zhigang Zeng

PDF

1 Repo

TL;DR

This paper introduces MORA, a novel multi-objective reward method that expands reward diversity by rewriting prompts, effectively overcoming the inherent trade-offs in aligning large language models with multiple human preferences.

Contribution

MORA fundamentally addresses the limitations of fixed Pareto frontiers by expanding reward dimensions through prompt rewriting, enabling better multi-objective alignment.

Findings

01

MORA improves single-preference scores by 5% to 12.4%.

02

MORA achieves a 4.6% average overall reward improvement.

03

Extensive experiments validate MORA's effectiveness in multi-objective alignment.

Abstract

In the realm of multi-objective alignment for large language models, balancing disparate human preferences often manifests as a zero-sum conflict. Specifically, the intrinsic tension between competing goals dictates that aggressively optimizing for one metric (e.g., helpfulness) frequently incurs a substantial penalty on another (e.g., harmlessness). While prior work mainly focuses on data selection, parameter merging, or algorithmic balancing during training, these approaches merely force compromises between divergent preferences along a fixed Pareto frontier, failing to fundamentally resolve the inherent trade-off. In this work, we approach this problem from a novel perspective of multi-dimensional rewards. By scaling up the model's rollouts and analyzing the outputs across different reward dimensions, we arrive at a critical conclusion: the conflict among multiple objectives stems…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Shiying-Huang/MORA-MPA
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.