Weights-Rotated Preference Optimization for Large Language Models

Chenxu Yang; Ruipeng Jia; Mingyu Zheng; Naibin Gu; Zheng Lin; Siyuan Chen; Weichong Yin; Hua Wu; Weiping Wang

arXiv:2508.17637·cs.CL·August 26, 2025

Weights-Rotated Preference Optimization for Large Language Models

Chenxu Yang, Ruipeng Jia, Mingyu Zheng, Naibin Gu, Zheng Lin, Siyuan Chen, Weichong Yin, Hua Wu, Weiping Wang

PDF

1 Video

TL;DR

This paper introduces RoPO, a novel optimization method for large language models that reduces reward hacking by constraining model outputs and hidden states, improving alignment and knowledge retention with minimal additional parameters.

Contribution

The paper proposes Weights-Rotated Preference Optimization (RoPO), a new algorithm that effectively mitigates reward hacking in LLMs by combining implicit and explicit constraints during fine-tuning.

Findings

01

RoPO improves AlpacaEval 2 scores by up to 3.27 points.

02

RoPO surpasses baseline MT-Bench scores by 6.2 to 7.5 points.

03

RoPO achieves these results with only 0.015% of trainable parameters.

Abstract

Despite the efficacy of Direct Preference Optimization (DPO) in aligning Large Language Models (LLMs), reward hacking remains a pivotal challenge. This issue emerges when LLMs excessively reduce the probability of rejected completions to achieve high rewards, without genuinely meeting their intended goals. As a result, this leads to overly lengthy generation lacking diversity, as well as catastrophic forgetting of knowledge. We investigate the underlying reason behind this issue, which is representation redundancy caused by neuron collapse in the parameter space. Hence, we propose a novel Weights-Rotated Preference Optimization (RoPO) algorithm, which implicitly constrains the output layer logits with the KL divergence inherited from DPO and explicitly constrains the intermediate hidden states by fine-tuning on a multi-granularity orthogonal matrix. This design prevents the policy model…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Weights-Rotated Preference Optimization for Large Language Models· underline