The Fragility Of Moral Judgment In Large Language Models
Tom van Nuenen, Pratik S. Sachdeva

TL;DR
This study investigates the stability of large language models' moral judgments by applying various content perturbations, revealing significant fragility influenced by narrative voice and presentation, which raises concerns about reproducibility and fairness.
Contribution
Introduces a perturbation framework to test LLM moral judgment stability, highlighting the influence of narrative cues and presentation on moral assessments.
Findings
Surface noise causes minimal judgment flips (~7.5%)
Perspective shifts significantly increase instability (~24.3%)
Judgments are heavily influenced by narrative voice and presentation
Abstract
People increasingly use large language models (LLMs) for everyday moral and interpersonal guidance, yet these systems cannot interrogate missing context and judge dilemmas as presented. We introduce a perturbation framework for testing the stability and manipulability of LLM moral judgments while holding the underlying moral conflict constant. Using 2,939 dilemmas from r/AmItheAsshole (January-March 2025), we generate three families of content perturbations: surface edits (lexical/structural noise), point-of-view shifts (voice and stance neutralization), and persuasion cues (self-positioning, social proof, pattern admissions, victim framing). We also vary the evaluation protocol (output ordering, instruction placement, and unstructured prompting). We evaluated all variants with four models (GPT-4.1, Claude 3.7 Sonnet, DeepSeek V3, Qwen2.5-72B) (N=129,156 judgments). Surface…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Explainable Artificial Intelligence (XAI) · Artificial Intelligence in Healthcare and Education
