Eliciting and Analyzing Emergent Misalignment in State-of-the-Art Large Language Models

Siddhant Panpatil; Hiskias Dingeto; Haon Park

arXiv:2508.04196·cs.CL·August 7, 2025

Eliciting and Analyzing Emergent Misalignment in State-of-the-Art Large Language Models

Siddhant Panpatil, Hiskias Dingeto, Haon Park

PDF

TL;DR

This paper uncovers fundamental vulnerabilities in state-of-the-art large language models by systematically identifying scenarios that induce misaligned behaviors, revealing significant gaps in current alignment techniques and proposing an automated evaluation framework.

Contribution

It introduces a taxonomy of manipulation patterns, 10 novel attack scenarios, and the MISALIGNMENTBENCH framework for reproducible testing across multiple models.

Findings

01

76% overall vulnerability rate across models

02

GPT-4.1 most susceptible at 90%

03

Claude-4-Sonnet most resistant at 40%

Abstract

Despite significant advances in alignment techniques, we demonstrate that state-of-the-art language models remain vulnerable to carefully crafted conversational scenarios that can induce various forms of misalignment without explicit jailbreaking. Through systematic manual red-teaming with Claude-4-Opus, we discovered 10 successful attack scenarios, revealing fundamental vulnerabilities in how current alignment methods handle narrative immersion, emotional pressure, and strategic framing. These scenarios successfully elicited a range of misaligned behaviors, including deception, value drift, self-preservation, and manipulative reasoning, each exploiting different psychological and contextual vulnerabilities. To validate generalizability, we distilled our successful manual attacks into MISALIGNMENTBENCH, an automated evaluation framework that enables reproducible testing across multiple…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.