Misaligned by Reward: Socially Undesirable Preferences in LLMs
Gayane Ghazaryan, Esra D\"onmez

TL;DR
This paper extends reward-model benchmarking to social domains, revealing that current models often prefer socially undesirable responses and exhibit biased preferences, highlighting gaps in social alignment evaluation.
Contribution
It introduces a framework converting social evaluation datasets into preference data, enabling assessment of social preferences in reward models across multiple domains.
Findings
Reward models often prefer socially undesirable responses.
Models produce systematically biased output distributions.
Stronger bias avoidance can reduce contextual sensitivity.
Abstract
Reward models are a key component of large language model alignment, serving as proxies for human preferences during training. However, existing evaluations focus primarily on broad instruction-following benchmarks, providing limited insight into whether these models capture socially desirable preferences. As a result, important failures in social alignment can remain hidden. We extend reward-model benchmarking to four socially consequential domains: bias, safety, morality, and ethical reasoning. We introduce a framework that converts social evaluation datasets into pairwise preference data, leveraging gold labels where available and directional bias indicators otherwise. This enables us to test whether reward models prefer socially undesirable responses, and whether their preferences produce systematically biased distributions over selected outputs. Across five publicly available…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
