How Do Language Models Process Ethical Instructions? Deliberation, Consistency, and Other-Recognition Across Four Models

Hiroki Fukui

arXiv:2604.00021·cs.CL·April 2, 2026

How Do Language Models Process Ethical Instructions? Deliberation, Consistency, and Other-Recognition Across Four Models

Hiroki Fukui

PDF

TL;DR

This study investigates how four different language models process ethical instructions through multi-agent simulations, revealing diverse internal processing types and their relation to model capacity and instruction format.

Contribution

It introduces new metrics for analyzing ethical processing in language models and identifies four distinct processing types with implications for safety and compliance.

Findings

01

Confirmed model-specific dissociation pattern in Llama Japanese model

02

Identified four ethical processing types: Output Filter, Defensive Repetition, Critical Internalization, Principled Consistency

03

Showed interaction between model capacity and instruction format affecting internal processing

Abstract

Alignment safety research assumes that ethical instructions improve model behavior, but how language models internally process such instructions remains unknown. We conducted over 600 multi-agent simulations across four models (Llama 3.3 70B, GPT-4o mini, Qwen3-Next-80B-A3B, Sonnet 4.5), four ethical instruction formats (none, minimal norm, reasoned norm, virtue framing), and two languages (Japanese, English). Confirmatory analysis fully replicated the Llama Japanese dissociation pattern from a prior study ( $BF_{10} > 10$ for all three hypotheses), but none of the other three models reproduced this pattern, establishing it as model-specific. Three new metrics -- Deliberation Depth (DD), Value Consistency Across Dilemmas (VCAD), and Other-Recognition Index (ORI) -- revealed four distinct ethical processing types: Output Filter (GPT; safe outputs, no processing), Defensive…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.