Eliciting and Analyzing Emergent Misalignment in State-of-the-Art Large Language Models
Siddhant Panpatil, Hiskias Dingeto, Haon Park

TL;DR
This paper uncovers fundamental vulnerabilities in state-of-the-art large language models by systematically identifying scenarios that induce misaligned behaviors, revealing significant gaps in current alignment techniques and proposing an automated evaluation framework.
Contribution
It introduces a taxonomy of manipulation patterns, 10 novel attack scenarios, and the MISALIGNMENTBENCH framework for reproducible testing across multiple models.
Findings
76% overall vulnerability rate across models
GPT-4.1 most susceptible at 90%
Claude-4-Sonnet most resistant at 40%
Abstract
Despite significant advances in alignment techniques, we demonstrate that state-of-the-art language models remain vulnerable to carefully crafted conversational scenarios that can induce various forms of misalignment without explicit jailbreaking. Through systematic manual red-teaming with Claude-4-Opus, we discovered 10 successful attack scenarios, revealing fundamental vulnerabilities in how current alignment methods handle narrative immersion, emotional pressure, and strategic framing. These scenarios successfully elicited a range of misaligned behaviors, including deception, value drift, self-preservation, and manipulative reasoning, each exploiting different psychological and contextual vulnerabilities. To validate generalizability, we distilled our successful manual attacks into MISALIGNMENTBENCH, an automated evaluation framework that enables reproducible testing across multiple…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
