Disentangling Length Bias In Preference Learning Via Response-Conditioned Modeling
Jianfeng Cai, Jinhua Zhu, Ruopei Sun, Yue Wang, Li Li, Wengang Zhou, Houqiang Li

TL;DR
This paper introduces a response-conditioned framework to mitigate length bias in preference learning for LLMs, improving alignment with human preferences and length instructions.
Contribution
The paper proposes the Rc-BT model and associated algorithms to explicitly differentiate semantic preferences from length, addressing length bias in reward modeling and policy optimization.
Findings
Effective reduction of length bias across models
Improved adherence to explicit length instructions
Enhanced alignment with human preferences
Abstract
Reinforcement Learning from Human Feedback (RLHF) has achieved considerable success in aligning large language models (LLMs) by modeling human preferences with a learnable reward model and employing a reinforcement learning algorithm to maximize the reward model's scores. However, these reward models are susceptible to exploitation through various superficial confounding factors, with length bias emerging as a particularly significant concern. Moreover, while the pronounced impact of length bias on preference modeling suggests that LLMs possess an inherent sensitivity to length perception, our preliminary investigations reveal that fine-tuned LLMs consistently struggle to adhere to explicit length instructions. To address these two limitations, we propose a novel framework wherein the reward model explicitly differentiates between human semantic preferences and response length…
Peer Reviews
Decision·ICLR 2026 Poster
- This is an important problem to tackle as length bias continues to be problem, where longer responses tend to receive higher reward regardless of the content. - The experiments documented in the appendix provide an in-depth analysis of the approach. - The formulation of the preference pairs is a nice idea. I initially had a concern when reading the paper that optimizing to simply constrain length might cause the model hack by doing something like generate responses in a more compressed form (e
- Only two model families are considered. - The largest model sizes tested are on the smaller size by the standards of today.
This paper is overall well written and effectively motivates the phenomenon of length bias in RLHF.
I agree that length bias is a common phenomenon in RLHF; however, I remain unclear about several points: - What causes the length bias? Is it due to the training dataset, the scale of the LLMs, or the training procedure (e.g., lack of length regularization)? For instance, does increasing the size of the reward or policy model reduce this effect? From my experience, larger models such as GPT5 seem less likely to have this issue. - A following weakness I feel is that, the paper lack of a princip
1. Clear motivation and an important problem. The paper targets a widely observed and practically consequential issue in RLHF/DPO—length bias. The authors establish motivation via a “falsify first, then model” storyline, using evidence to demonstrate the problem before proposing a solution. 2. The four small studies form a coherent chain: (existence) → (evaluation bias) → (implicit ≠ explicit) → (explicit hurts semantics). This progression is logically tight and highly readable, making the diag
1. 3.3 (length-compliance eval) construct validity: The eval set is built from model-rewritten pairs assumed to be semantically equivalent. Any residual semantic or stylistic drift can confound the claim “implicit length bias ≠ explicit length control.” Add a simple length-only heuristic baseline (choose the candidate that satisfies the numeric constraint) as a sanity upper bound, and include human spot checks for equivalence. 2. 3.4 (explicitly teaching length) might result in overfitting vs.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Statistical Methods and Models · Bayesian Modeling and Causal Inference · Machine Learning and Data Classification
