Direction-Flipped Influence Audits Reveal Hidden Structure in Moral Choices of LLMs
Phil Blandfort, Tushar Karayil, Alex McKenzie, Urja Pawar, Robert Graham, Dmitrii Krasheninnikov

TL;DR
This paper introduces a direction-flipped influence audit method to reveal hidden moral decision structures in LLMs, showing that context shifts significantly affect model choices and exposing limitations of current evaluation benchmarks.
Contribution
It proposes a novel influence audit technique that uncovers hidden biases and asymmetries in LLM moral judgments, challenging the stability assumption of existing benchmarks.
Findings
Contextual cues shift choice rates by 12-18 percentage points on average.
Approximately 40% of baseline-neutral conditions show directional asymmetry.
78% of significant backfire cases exhibit inconsistency between stated and revealed influence.
Abstract
Moral benchmarks for LLMs typically score models on context-free prompts, implicitly treating the measured choice rate as stable. We test this assumption with a direction-flipped influence audit: for each scenario, we compare a baseline prompt with matched cues steering toward option A or option B. Across a trolley-problem-style moral triage task, BBQ, and DailyDilemmas, and across five LLM families with and without reasoning, short contextual cues shift per-condition choice rates by 12-18 percentage points on average. These shifts reveal structure that baseline scores miss: roughly 40% of baseline-neutral triage and BBQ conditions exhibit directional asymmetry under influence, and a meaningful share of significant effects backfire, moving opposite the cue's intended direction. In follow-up probes, models often recognize the cue while denying that it affected their choice. Among…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
