On Monotonicity in AI Alignment

Gilles Bareilles; Julien Fageot; L\^e-Nguy\^en Hoang; Peva Blanchard; Wassim Bouaziz; S\'ebastien Rouault; El-Mahdi El-Mhamdi

arXiv:2506.08998·math.ST·June 17, 2025

On Monotonicity in AI Alignment

Gilles Bareilles, Julien Fageot, L\^e-Nguy\^en Hoang, Peva Blanchard, Wassim Bouaziz, S\'ebastien Rouault, El-Mahdi El-Mhamdi

PDF

Open Access 3 Reviews

TL;DR

This paper investigates the causes of non-monotonic behavior in comparison-based preference learning methods for AI alignment, providing theoretical insights and conditions to evaluate and improve their trustworthiness.

Contribution

It offers a formal analysis of monotonicity in preference learning frameworks, identifying conditions for monotonicity guarantees and clarifying limitations of current methods.

Findings

01

Models satisfy local pairwise monotonicity under mild assumptions.

02

Provides formalizations and conditions for monotonicity guarantees.

03

Clarifies limitations and guides development of trustworthy preference learning algorithms.

Abstract

Comparison-based preference learning has become central to the alignment of AI models with human preferences. However, these methods may behave counterintuitively. After empirically observing that, when accounting for a preference for response $y$ over $z$ , the model may actually decrease the probability (and reward) of generating $y$ (an observation also made by others), this paper investigates the root causes of (non) monotonicity, for a general comparison-based preference learning framework that subsumes Direct Preference Optimization (DPO), Generalized Preference Optimization (GPO) and Generalized Bradley-Terry (GBT). Under mild assumptions, we prove that such methods still satisfy what we call local pairwise monotonicity. We also provide a bouquet of formalizations of monotonicity, and identify sufficient conditions for their guarantee, thereby providing a toolbox to evaluate how…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 6Confidence 2

Strengths

* Math is well-done; I did not make a "deep dive" but was able to follow the math and did not catch any errors or inconsistencies. A substantial piece of this paper is dedicated to theory, so this is fairly significant. * The paper is well-motivated; the main novelty (monotonicity taxonomy and local pairwise guarantee) are clear and potential impacts with future work are clear. * Overall, the paper is well-written. The Structure is clear and notation is consistent.

Weaknesses

*The motivating example is good, but the figure is small. I would like to see a zero line, larger fonts, and clearer panel labels. *The claim of a "toolbox" (as written in the abstract) feels somewhat strong to me. The authors do formalize several forms of monotonicity, which is appreciated, but there is not even a minimal empirical example or guidance on how to use and interpret results. I recognize space constraints but a clearer link between the sections would be appreciated.

Reviewer 02Rating 6Confidence 2

Strengths

< Strength > - The paper addresses a practically significant issue in AI alignment. The empirical observation that preferred responses can have decreasing scores during training is widely-known concern for its reliability. The motivating example in Figure 1 effectively demonstrates this counterintuitive behavior across multiple Llama models. - The general formulation in Section 3 successfully unifies multiple existing methods (BT, DPO, GPO, GBT) under a common loss structure and thereby enables

Weaknesses

< Weakness > - All 6 models tested are from the Llama family (3.1 8B, 3.2 3B, 3.2 1B with base/instruct variants) and this can raise concerns about generalizability. Testing on other architectures (e.g., Qwen) would strengthen confidence that findings aren't specific to Llama's particular parameterization. - Although the paper is a theoretical analysis paper, it lacks a practical Interpretation or experiments. It would be great if the paper mention about what the theoretical guarantees mean for

Reviewer 03Rating 2Confidence 3

Strengths

The authors apparently know what they want to study.

Weaknesses

The problem is, we have no idea what's the implication of their findings in terms of helping with either SFT or inference. There are no numerical or experimental results.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsExplainable Artificial Intelligence (XAI) · Bayesian Modeling and Causal Inference · Constraint Satisfaction and Optimization