Preference learning in shades of gray: Interpretable and bias-aware reward modeling for human preferences
Simona-Vasilica Oprea, Adela B\^ara

TL;DR
This paper introduces a feature-augmented, interpretable framework for reward modeling in language models that improves preference prediction accuracy and provides insights into decision factors and bias amplification.
Contribution
It proposes a hybrid approach combining textual signals and interpretability tools to better capture human preferences and analyze biases in language models.
Findings
Improved ROC AUC from below 0.74 to up to 0.84 across models.
Enhanced interpretability using SHAP and LIME to understand decision factors.
Bias interactions influence preference learning despite weak individual feature effects.
Abstract
Learning human preferences in language models remains fundamentally challenging, as reward modeling relies on subtle, subjective comparisons or shades of gray rather than clear-cut labels. This study investigates the limits of current approaches and proposes a feature-augmented framework to better capture the multidimensional nature of human judgment. Using the Anthropic HHRLHF dataset, we evaluate ten diverse large language models LLMs under a standard pairwise preference setting, where baseline performance remains below 0.74 ROC AUC, highlighting the difficulty of the task. To address this, we enrich textual representations with interpretable signals: response length, refusal indicators, toxicity scores and prompt response semantic similarity, enabling models to explicitly capture key aspects of helpfulness, safety and relevance. The proposed hybrid approach yields consistent…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
