A Common Pitfall of Margin-based Language Model Alignment: Gradient   Entanglement

Hui Yuan; Yifan Zeng; Yue Wu; Huazheng Wang; Mengdi Wang; Liu Leqi

arXiv:2410.13828·cs.LG·April 23, 2025

A Common Pitfall of Margin-based Language Model Alignment: Gradient Entanglement

Hui Yuan, Yifan Zeng, Yue Wu, Huazheng Wang, Mengdi Wang, Liu Leqi

PDF

Open Access 1 Repo

TL;DR

This paper uncovers a fundamental flaw in margin-based language model alignment methods, where gradient entanglement causes unintended safety and performance issues, and proposes theoretical and empirical insights to address this problem.

Contribution

The paper identifies gradient entanglement as a key issue in margin-based LM alignment, providing theoretical conditions and empirical validation, and suggesting algorithmic improvements.

Findings

01

Margin-based losses couple preferred and dispreferred response probabilities.

02

Gradient entanglement can cause safety and performance failures.

03

Theoretical conditions for when gradient entanglement becomes problematic.

Abstract

Reinforcement Learning from Human Feedback (RLHF) has become the predominant approach for language model (LM) alignment. At its core, RLHF uses a margin-based loss for preference optimization, specifying ideal LM behavior only by the difference between preferred and dispreferred responses. In this paper, we identify a common pitfall of margin-based methods -- the under-specification of ideal LM behavior on preferred and dispreferred responses individually, which leads to two unintended consequences as the margin increases: (1) The probability of dispreferred (e.g., unsafe) responses may increase, resulting in potential safety alignment failures. (2) The probability of preferred responses may decrease, even when those responses are ideal. We demystify the reasons behind these problematic behaviors: margin-based losses couple the change in the preferred probability to the gradient of the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

humainlab/understand_marginpo
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling