Misaligned by Reward: Socially Undesirable Preferences in LLMs

Gayane Ghazaryan; Esra D\"onmez

arXiv:2605.05003·cs.CL·May 7, 2026

Misaligned by Reward: Socially Undesirable Preferences in LLMs

Gayane Ghazaryan, Esra D\"onmez

PDF

TL;DR

This paper extends reward-model benchmarking to social domains, revealing that current models often prefer socially undesirable responses and exhibit biased preferences, highlighting gaps in social alignment evaluation.

Contribution

It introduces a framework converting social evaluation datasets into preference data, enabling assessment of social preferences in reward models across multiple domains.

Findings

01

Reward models often prefer socially undesirable responses.

02

Models produce systematically biased output distributions.

03

Stronger bias avoidance can reduce contextual sensitivity.

Abstract

Reward models are a key component of large language model alignment, serving as proxies for human preferences during training. However, existing evaluations focus primarily on broad instruction-following benchmarks, providing limited insight into whether these models capture socially desirable preferences. As a result, important failures in social alignment can remain hidden. We extend reward-model benchmarking to four socially consequential domains: bias, safety, morality, and ethical reasoning. We introduce a framework that converts social evaluation datasets into pairwise preference data, leveraging gold labels where available and directional bias indicators otherwise. This enables us to test whether reward models prefer socially undesirable responses, and whether their preferences produce systematically biased distributions over selected outputs. Across five publicly available…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.