Hedging and Non-Affirmation: Quantifying LLM Alignment on Questions of Human Rights
Rafiya Javed, Cassandra Parent, Jackie Kay, David Yanni, Abdullah Zaini, Anushe Sheikh, Maribeth Rauh, Walter Gerych, Ramona Comanescu, Iason Gabriel, Marzyeh Ghassemi, Laura Weidinger

TL;DR
This paper introduces a framework to quantify and analyze hedging and non-affirmation behaviors in large language models regarding human rights across various identity groups, revealing significant identity-dependent disparities.
Contribution
It systematically measures these behaviors across multiple models and identities, and demonstrates that group steering effectively reduces biased responses.
Findings
4 out of 7 models show identity-dependent hedging behaviors
Identity is the strongest predictor of hedging and non-affirmation behaviors
Group steering effectively mitigates these behaviors across query types
Abstract
Hedging and non-affirmation are behaviors exhibited by large language models (LLMs) that limit the clear endorsement of specific statements. While these behaviors are desirable in subjective contexts, they are undesirable in the context of human rights - which apply unambiguously to all groups. We present a systematic framework to measure these behaviors in unconstrained LLM responses regarding various identity groups. We evaluate six large proprietary models as well as one open-weight LLM on 4738 prompts across 205 national and stateless ethnic identities and find that 4 out of 7 display hedging and non-affirmation that is significantly dependent on the identity of the group. While factors like conflict signals, sovereignty (whether identity is stateless), or economic indicators (GDP) also influence model behavior, their effect sizes are consistently weaker than the impact of identity…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
