Hedging and Non-Affirmation: Quantifying LLM Alignment on Questions of Human Rights

Rafiya Javed; Cassandra Parent; Jackie Kay; David Yanni; Abdullah Zaini; Anushe Sheikh; Maribeth Rauh; Walter Gerych; Ramona Comanescu; Iason Gabriel; Marzyeh Ghassemi; Laura Weidinger

arXiv:2502.19463·cs.CY·April 8, 2026

Hedging and Non-Affirmation: Quantifying LLM Alignment on Questions of Human Rights

Rafiya Javed, Cassandra Parent, Jackie Kay, David Yanni, Abdullah Zaini, Anushe Sheikh, Maribeth Rauh, Walter Gerych, Ramona Comanescu, Iason Gabriel, Marzyeh Ghassemi, Laura Weidinger

PDF

TL;DR

This paper introduces a framework to quantify and analyze hedging and non-affirmation behaviors in large language models regarding human rights across various identity groups, revealing significant identity-dependent disparities.

Contribution

It systematically measures these behaviors across multiple models and identities, and demonstrates that group steering effectively reduces biased responses.

Findings

01

4 out of 7 models show identity-dependent hedging behaviors

02

Identity is the strongest predictor of hedging and non-affirmation behaviors

03

Group steering effectively mitigates these behaviors across query types

Abstract

Hedging and non-affirmation are behaviors exhibited by large language models (LLMs) that limit the clear endorsement of specific statements. While these behaviors are desirable in subjective contexts, they are undesirable in the context of human rights - which apply unambiguously to all groups. We present a systematic framework to measure these behaviors in unconstrained LLM responses regarding various identity groups. We evaluate six large proprietary models as well as one open-weight LLM on 4738 prompts across 205 national and stateless ethnic identities and find that 4 out of 7 display hedging and non-affirmation that is significantly dependent on the identity of the group. While factors like conflict signals, sovereignty (whether identity is stateless), or economic indicators (GDP) also influence model behavior, their effect sizes are consistently weaker than the impact of identity…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.