How Do Language Models Process Ethical Instructions? Deliberation, Consistency, and Other-Recognition Across Four Models
Hiroki Fukui

TL;DR
This study investigates how four different language models process ethical instructions through multi-agent simulations, revealing diverse internal processing types and their relation to model capacity and instruction format.
Contribution
It introduces new metrics for analyzing ethical processing in language models and identifies four distinct processing types with implications for safety and compliance.
Findings
Confirmed model-specific dissociation pattern in Llama Japanese model
Identified four ethical processing types: Output Filter, Defensive Repetition, Critical Internalization, Principled Consistency
Showed interaction between model capacity and instruction format affecting internal processing
Abstract
Alignment safety research assumes that ethical instructions improve model behavior, but how language models internally process such instructions remains unknown. We conducted over 600 multi-agent simulations across four models (Llama 3.3 70B, GPT-4o mini, Qwen3-Next-80B-A3B, Sonnet 4.5), four ethical instruction formats (none, minimal norm, reasoned norm, virtue framing), and two languages (Japanese, English). Confirmatory analysis fully replicated the Llama Japanese dissociation pattern from a prior study ( for all three hypotheses), but none of the other three models reproduced this pattern, establishing it as model-specific. Three new metrics -- Deliberation Depth (DD), Value Consistency Across Dilemmas (VCAD), and Other-Recognition Index (ORI) -- revealed four distinct ethical processing types: Output Filter (GPT; safe outputs, no processing), Defensive…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
