Flattery, Fluff, and Fog: Diagnosing and Mitigating Idiosyncratic Biases in Preference Models
Anirudh Bharadwaj, Chaitanya Malaviya, Nitish Joshi, Mark Yatskar

TL;DR
This paper investigates how training data biases influence preference model miscalibration in language models, revealing overreliance on superficial features and proposing a counterfactual data augmentation method to improve model reliability.
Contribution
It systematically links training data artifacts to preference model biases and introduces a simple counterfactual augmentation technique to mitigate these biases.
Findings
Preference models favor biased responses in over 60% of cases.
Bias features correlate weakly with human preferences but strongly with reward model labels.
Counterfactual data augmentation reduces model miscalibration and skew, improving reliability.
Abstract
Language models serve as proxies for human preference judgements in alignment and evaluation, yet they exhibit systematic miscalibration, prioritizing superficial patterns over substantive qualities. This bias manifests as overreliance on features like length, structure, and style, leading to issues like reward hacking and unreliable evaluations. However, the connection between training data artifacts and the miscalibrated preferences exhibited by models remains poorly understood. In this work, we systematically investigate the relationship between training data biases and preference model miscalibration across five idiosyncratic features of language model generations: length, structure, jargon, sycophancy and vagueness. Using controlled counterfactual pairs, we first quantify the extent to which preference models favor responses with artificially magnified biases (skew), finding this…
Peer Reviews
Decision·ICLR 2026 Poster
- This paper presents a comprehensive analysis of miscalibration in preference models, systematically examining five common bias features: verbosity, structure, jargon, sycophancy, and vagueness. Experimental results show that existing models exhibit substantial disagreement with human preferences. - It quantifies how imbalances in RLHF training data amplify these bias features, revealing a clear connection between data artifacts and model miscalibration. - The paper further introduces a simpl
- The effectiveness of the approach in real-world applications remains uncertain. While Counterfactual Data Augmentation (CDA) performs well in controlled settings, it may still struggle to generalize to diverse, dynamic environments where biases are complex and context-dependent. - The proposed CDA method targets a limited set of bias features, e.g., verbosity, jargon, which may lack sufficient diversity or challenge. Consequently, the model might only learn to correct biases in relatively si
< Strength > - The paper addresses a critical issue in RLHF of the overreliance of preference models on spurious surface-level features which can lead to reward hacking and unreliable evaluation. Although some of the specific features are already studied with adhoc-treatment, this paper provides more systematic method. - The use of counterfactual pairs through counterfactual data augmentation (CDA) is simple but provides a practical controlled experimental framework to isolate individual bias f
< Weakness > - Although the observed miscalibration is significant, the paper's central claim that training data imbalances cause model miscalibration is weakened by Figure 3. For sycophancy, only 5.7% of examples show the bias in off-diagonal, which is too small to explain the observed model behavior. Also for jargon, the 54.4% selection rate is barely above random chance (50%). Only structure shows strong imbalance (65.5%), yet it has the lowest miscalibration among all LLM evaluators. - The
1. I think this paper provides a valuable empirical contribution that sheds light on phenomena that I think many in the community have noticed when looking at LM outputs. I could see this encouraging further work that looks deeper at some of the qualitative insights people circulate about factors such as LLM sycophancy. To my knowledge, few prior works study these issues at as much of a comprehensive level. 2. The counterfactual approach for isolating the influence of each bias dimension is tec
1. Some validation of the gpt-4o based re-writing step would assuage concerns about its use. Even some more concrete examples would be helpful (in addition to those in Table 1). 2. The results for the augmentation-based preference model training method seem mixed. While that in and of itself is not a problem, I think it at least warrants a little bit more analysis. In addition, I think the results would be better contextualized with more comparisons to existing de-biasing methods where possible.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsRecommender Systems and Techniques · Topic Modeling · Sentiment Analysis and Opinion Mining
