The Hidden Link Between RLHF and Contrastive Learning
Xufei Lv, Kehai Chen, Haoyuan Sun, Xuefeng Bai, Min Zhang, Houde Liu, Kehai Chen

TL;DR
This paper reveals a fundamental connection between RLHF, DPO, and contrastive learning through mutual information maximization, offering a new perspective and improved methods for aligning large language models with human values.
Contribution
It introduces a novel MI-based framework unifying RLHF and DPO, proposes the MIO method using JS MI estimator, and demonstrates its effectiveness through theoretical and empirical analysis.
Findings
MIO mitigates late-stage decline in DPO performance
MIO achieves superior results on reasoning benchmarks
Contrastive learning perspective explains RLHF's limitations
Abstract
Alignment of large language models (LLMs) with human values has recently garnered significant attention, with prominent examples including the canonical yet costly Reinforcement Learning from Human Feedback (RLHF) and the simple Direct Preference Optimization (DPO). In this work, we demonstrate that both RLHF and DPO can be interpreted from the perspective of mutual information (MI) maximization, uncovering a profound connection to contrastive learning. Within this framework, both RLHF and DPO can be interpreted as methods that performing contrastive learning based on the positive and negative samples derived from base model, leveraging the Donsker-Varadhan (DV) lower bound on MI (equivalently, the MINE estimator). Such paradigm further illuminates why RLHF may not intrinsically incentivize reasoning capacities in LLMs beyond what is already present in the base model. Building on the…
Peer Reviews
Decision·Submitted to ICLR 2026
- The work originally connects popular alignment algorithms (RLHF, DPO) to contrastive learning via mutual information (MI) maximization. The key insight—attributing their observed failure modes not just to the objective but to the specific MI estimator they implicitly use (the Donsker-Varadhan bound) —is highly creative. - The paper is of high quality, providing rigorous theoretical support for its claims. It introduces the "DV/MINE Starvation Theorem" to formalize why DV-based methods fail wh
The authors demonstrate MIO's strong performance in Table 1, but almost exclusively on a suite of mathematical and reasoning benchmarks. This is the exact domain the paper identifies as a key failure point for DPO. While this supports the specific claim about fixing DPO's performance degradation on these tasks, it fails to substantiate MIO as a superior general-purpose alignment method. The evaluation is missing key standard benchmarks needed to assess the full picture and potential trade-offs,
The claims in the summary above look strong and novel, as it theoretically explains the failure of the existing RLHF and DPO approaches, and proposes a new method to solve this also with theoretical foundation. The experimental details look comprehensive to reproduce the results. The experiments cover many existing models and benchmarks. The improvements in the experimental results by the proposed MIO method look significant in some cases.
The math formulations are very unclear. Specifically, the vagueness of notations especially $T_{\phi}$ persists throughout the whole paper. I list the main unclear points in the questions below.
1. The paper proposes a unifying theoretical framework: a mutual information-based perspective that unifies RLHF and DPO with contrastive learning. 2. The empirical validation uses a toy model to isolate the failure mode, large-scale fine-tuning on multiple base models, and compares against some representative baselines across several reasoning benchmarks. 3. The paper is well-structured and the authors have provided detailed experimental setups.
1. The first concern is that this work only replaces an existing information form (DV) with another existing information form (JS). Moreover, to simplify deduction and calculation, it restricts the critic to the log ratio family and yields only an even looser *surrogate* bound (equiv. to Eq. 15). Then the learning scheme is to optimize this looser surrogate bound, which is a plain half-mix of the "chosen" and "rejection" informations. It is difficult to extract a significant contribution from Eq
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsExplainable Artificial Intelligence (XAI) · Multimodal Machine Learning Applications · Topic Modeling
