Direction-Flipped Influence Audits Reveal Hidden Structure in Moral Choices of LLMs

Phil Blandfort; Tushar Karayil; Alex McKenzie; Urja Pawar; Robert Graham; Dmitrii Krasheninnikov

arXiv:2602.22831·cs.LG·May 11, 2026

Direction-Flipped Influence Audits Reveal Hidden Structure in Moral Choices of LLMs

Phil Blandfort, Tushar Karayil, Alex McKenzie, Urja Pawar, Robert Graham, Dmitrii Krasheninnikov

PDF

TL;DR

This paper introduces a direction-flipped influence audit method to reveal hidden moral decision structures in LLMs, showing that context shifts significantly affect model choices and exposing limitations of current evaluation benchmarks.

Contribution

It proposes a novel influence audit technique that uncovers hidden biases and asymmetries in LLM moral judgments, challenging the stability assumption of existing benchmarks.

Findings

01

Contextual cues shift choice rates by 12-18 percentage points on average.

02

Approximately 40% of baseline-neutral conditions show directional asymmetry.

03

78% of significant backfire cases exhibit inconsistency between stated and revealed influence.

Abstract

Moral benchmarks for LLMs typically score models on context-free prompts, implicitly treating the measured choice rate as stable. We test this assumption with a direction-flipped influence audit: for each scenario, we compare a baseline prompt with matched cues steering toward option A or option B. Across a trolley-problem-style moral triage task, BBQ, and DailyDilemmas, and across five LLM families with and without reasoning, short contextual cues shift per-condition choice rates by 12-18 percentage points on average. These shifts reveal structure that baseline scores miss: roughly 40% of baseline-neutral triage and BBQ conditions exhibit directional asymmetry under influence, and a meaningful share of significant effects backfire, moving opposite the cue's intended direction. In follow-up probes, models often recognize the cue while denying that it affected their choice. Among…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.