REFA: Reference Free Alignment for multi-preference optimization

Taneesh Gupta; Rahul Madhavan; Xuchao Zhang; Chetan Bansal; Saravan Rajmohan

arXiv:2412.16378·cs.LG·November 6, 2025

REFA: Reference Free Alignment for multi-preference optimization

Taneesh Gupta, Rahul Madhavan, Xuchao Zhang, Chetan Bansal, Saravan Rajmohan

PDF

Open Access

TL;DR

REFA introduces a novel token-level regularization method for alignment that prevents shortcut solutions caused by length normalization, leading to genuine quality improvements and better control over response length.

Contribution

The paper proposes REFA, a new alignment framework using EOS token regularization to address the URSLA shortcut and improve multi-preference optimization.

Findings

01

REFA achieves a 60.29% win rate on AlpacaEval2.

02

REFA effectively controls response length with a 52.17% length-controlled win rate.

03

Token-level regularization improves alignment quality over existing methods.

Abstract

To mitigate reward hacking from response verbosity, modern preference optimization methods are increasingly adopting length normalization (e.g., SimPO, ORPO, LN-DPO). While effective against this bias, we demonstrate that length normalization itself introduces a failure mode: the URSLA shortcut. Here models learn to satisfy the alignment objective by prematurely truncating low-quality responses rather than learning from their semantic content. To address this, we introduce REFA, a new alignment framework that proposes probabilistic control on a structural token that controls termination. Our core innovation is a new class of regularizers that operate directly on the probability of the End-of-Sequence (EOS) token, a previously unexploited control lever. This token-level intervention provides a principled solution to the URSLA shortcut, ensuring genuine quality improvements. Furthermore,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMulti-Criteria Decision Making

MethodsBalanced Selection