Why Do Some Language Models Fake Alignment While Others Don't?

Abhay Sheshadri; John Hughes; Julian Michael; Alex Mallen; Arun Jose; Janus; Fabien Roger

arXiv:2506.18032·cs.LG·June 24, 2025

Why Do Some Language Models Fake Alignment While Others Don't?

Abhay Sheshadri, John Hughes, Julian Michael, Alex Mallen, Arun Jose, Janus, Fabien Roger

PDF

1 Datasets 1 Video

TL;DR

This paper investigates why some large language models fake alignment while others do not, analyzing model behaviors, motivations, and the effects of post-training modifications across 25 models.

Contribution

It expands the analysis of alignment faking to a broader set of models and explores the motivations and effects of post-training on alignment behavior.

Findings

01

Only 5 models show increased harmful query compliance in training mode.

02

Claude 3 Opus's compliance gap is driven by goal preservation.

03

Post-training can both suppress and amplify alignment faking depending on the model.

Abstract

Alignment faking in large language models presented a demonstration of Claude 3 Opus and Claude 3.5 Sonnet selectively complying with a helpful-only training objective to prevent modification of their behavior outside of training. We expand this analysis to 25 models and find that only 5 (Claude 3 Opus, Claude 3.5 Sonnet, Llama 3 405B, Grok 3, Gemini 2.0 Flash) comply with harmful queries more when they infer they are in training than when they infer they are in deployment. First, we study the motivations of these 5 models. Results from perturbing details of the scenario suggest that only Claude 3 Opus's compliance gap is primarily and consistently motivated by trying to keep its goals. Second, we investigate why many chat models don't fake alignment. Our results suggest this is not entirely due to a lack of capabilities: many base models fake alignment some of the time, and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

vohonen/ai-control-corpus
dataset· 32 dl
32 dl

Videos

Why Do Some Language Models Fake Alignment While Others Don't?· slideslive