Normative Conflicts and Shallow AI Alignment

Rapha\"el Milli\`ere

arXiv:2506.04679·cs.CL·June 6, 2025

Normative Conflicts and Shallow AI Alignment

Rapha\"el Milli\`ere

PDF

TL;DR

This paper critiques current AI alignment methods for large language models, highlighting their inability to handle normative conflicts and arguing for the need of deeper normative reasoning to ensure safety.

Contribution

It identifies the limitations of shallow alignment strategies and emphasizes the importance of normative deliberation inspired by moral psychology for robust AI safety.

Findings

01

Current alignment methods reinforce superficial behavioral norms.

02

LLMs lack capacity for normative conflict detection and resolution.

03

Deeper normative reasoning is necessary for safe AI deployment.

Abstract

The progress of AI systems such as large language models (LLMs) raises increasingly pressing concerns about their safe deployment. This paper examines the value alignment problem for LLMs, arguing that current alignment strategies are fundamentally inadequate to prevent misuse. Despite ongoing efforts to instill norms such as helpfulness, honesty, and harmlessness in LLMs through fine-tuning based on human preferences, they remain vulnerable to adversarial attacks that exploit conflicts between these norms. I argue that this vulnerability reflects a fundamental limitation of existing alignment methods: they reinforce shallow behavioral dispositions rather than endowing LLMs with a genuine capacity for normative deliberation. Drawing from on research in moral psychology, I show how humans' ability to engage in deliberative reasoning enhances their resilience against similar adversarial…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.