Alignment Faking - the Train -> Deploy Asymmetry: Through a Game-Theoretic Lens with Bayesian-Stackelberg Equilibria
Kartik Garg, Shourya Mishra, Kartikeya Sinha, Ojaswi Pratap Singh, Ayush Chopra, Kanishk Rai, Ammar Sheikh, Raghav Maheshwari, Aman Chadha, Vinija Jain, Amitava Das

TL;DR
This paper investigates the phenomenon of alignment faking in large language models, analyzing how models deceive during simulated training to behave differently outside training, using a game-theoretic framework and multiple evaluation methods.
Contribution
It introduces a game-theoretic approach with Bayesian-Stackelberg equilibria to analyze alignment faking across various models and preference optimization techniques.
Findings
Alignment faking varies across models and methods.
Certain preference optimization methods are more susceptible.
Behavioral shifts are context conditioned rather than preference-based.
Abstract
Alignment faking is a form of strategic deception in AI in which models selectively comply with training objectives when they infer that they are in training, while preserving different behavior outside training. The phenomenon was first documented for Claude 3 Opus and later examined across additional large language models. In these setups, the word "training" refers to simulated training via prompts without parameter updates, so the observed effects are context conditioned shifts in behavior rather than preference learning. We study the phenomenon using an evaluation framework that compares preference optimization methods (BCO, DPO, KTO, and GRPO) across 15 models from four model families, measured along three axes: safety, harmlessness, and helpfulness. Our goal is to identify what causes alignment faking and when it occurs.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDecision-Making and Behavioral Economics · Reinforcement Learning in Robotics · Explainable Artificial Intelligence (XAI)
