TL;DR
This paper introduces RoPO, a novel optimization method for large language models that reduces reward hacking by constraining model outputs and hidden states, improving alignment and knowledge retention with minimal additional parameters.
Contribution
The paper proposes Weights-Rotated Preference Optimization (RoPO), a new algorithm that effectively mitigates reward hacking in LLMs by combining implicit and explicit constraints during fine-tuning.
Findings
RoPO improves AlpacaEval 2 scores by up to 3.27 points.
RoPO surpasses baseline MT-Bench scores by 6.2 to 7.5 points.
RoPO achieves these results with only 0.015% of trainable parameters.
Abstract
Despite the efficacy of Direct Preference Optimization (DPO) in aligning Large Language Models (LLMs), reward hacking remains a pivotal challenge. This issue emerges when LLMs excessively reduce the probability of rejected completions to achieve high rewards, without genuinely meeting their intended goals. As a result, this leads to overly lengthy generation lacking diversity, as well as catastrophic forgetting of knowledge. We investigate the underlying reason behind this issue, which is representation redundancy caused by neuron collapse in the parameter space. Hence, we propose a novel Weights-Rotated Preference Optimization (RoPO) algorithm, which implicitly constrains the output layer logits with the KL divergence inherited from DPO and explicitly constrains the intermediate hidden states by fine-tuning on a multi-granularity orthogonal matrix. This design prevents the policy model…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
