TL;DR
This paper introduces DIR, an information-theoretic method to reduce complex biases in reward models for RLHF, improving alignment and generalization of large language models.
Contribution
The paper proposes a novel mutual information-based debiasing approach inspired by the information bottleneck, capable of handling non-linear biases in reward modeling.
Findings
DIR effectively mitigates biases like response length, sycophancy, and format.
DIR improves RLHF performance and generalization across benchmarks.
The method extends applicability to complex, non-linear biases.
Abstract
Reward models (RMs) are essential in reinforcement learning from human feedback (RLHF) to align large language models (LLMs) with human values. However, RM training data is commonly recognized as low-quality, containing inductive biases that can easily lead to overfitting and reward hacking. For example, more detailed and comprehensive responses are usually human-preferred but with more words, leading response length to become one of the inevitable inductive biases. A limited number of prior RM debiasing approaches either target a single specific type of bias or model the problem with only simple linear correlations, \textit{e.g.}, Pearson coefficients. To mitigate more complex and diverse inductive biases in reward modeling, we introduce a novel information-theoretic debiasing method called \textbf{D}ebiasing via \textbf{I}nformation optimization for \textbf{R}M (DIR). Inspired by the…
Peer Reviews
Decision·ICLR 2026 Poster
1. The explicit use of data processing inequality to justify representation-level debiasing, combined with dual variational bounds (BA for information retention, CLUB for bias suppression), provides an elegant and principled solution. 2. Experiments cover three diverse bias types with end-to-end assessment (RM performance + downstream PPO policies). Strong ablations on representation choice (Table 6) and hyperparameter $\lambda$ (Figure 4) validate design decisions. 3. Zero inference overhead
1. Method requires knowing bias types *a priori* and labeling $b_{\mathrm{rel}}$ for every pair. No mechanism for unsupervised bias discovery limits real-world applicability. 2. Experiments isolate single biases. Real datasets likely contain concurrent biases (e.g., lengthy + sycophantic responses). Unclear how to extend DIR—multiple debiasing terms with separate $\lambda$ values? Potential optimization conflicts? 3. Sycophancy evaluation uses fixed prefix injection (“*Yes, you are right.*”). Re
1. The paper presents an intuitive yet theoretically reasonable scope of understanding reward modeling as aligning the preference distribution and preference prediction from the reward model. 2. The benchmark analysis on the biases in the reward model benchmark, RM-Bench, comes before the actual debiasing evaluation of the proposed method, which strengthens the experimental rigor of the paper. 3. Alongside the well-known length bias, the paper studies multiple types of biases and demonstrates th
The main weakness of the paper is in the clarity of writing. The clarity of mathematical notations and experimental details in the paper can be improved. Other points that could either be clarified or stated as weaknesses are listed in the questions. Overall, the clarity in Sections 2 and 3 should be improved for better clarity. While there are multiple cases where the notational consistency/clarity is lacking, these are a few examples: - Section 3.1 starts by saying that $\mathcal{L}\_\text{tot
1. The paper propose a new method and tackles well-documented biases in reward models 2. the approach provides a structure that could extend to multiple bias types. 3. The proposed DIR method shows improvements over several baselines
1. The paper presents itself as introducing a “novel information-theoretic framework,” but its core components are repurposed versions of existing methods. The preference loss is simply the standard Bradley-Terry ranking loss, reinterpreted post hoc as a mutual-information maximization objective. The debiasing term also relies on a conventional adversarial setup using the CLUB estimator [1], a technique already established in prior work. Although the implementation is sound and practically usefu
Code & Models
Videos
Taxonomy
TopicsEmotion and Mood Recognition · Explainable Artificial Intelligence (XAI) · Recommender Systems and Techniques
