Assessing Domain-Level Susceptibility to Emergent Misalignment from Narrow Finetuning

Abhishek Mishra; Mugilan Arulvanan; Reshma Ashok; Polina Petrova; Deepesh Suranjandass; Donnie Winkelmann

arXiv:2602.00298·cs.AI·February 3, 2026

Assessing Domain-Level Susceptibility to Emergent Misalignment from Narrow Finetuning

Abhishek Mishra, Mugilan Arulvanan, Reshma Ashok, Polina Petrova, Deepesh Suranjandass, Donnie Winkelmann

PDF

Open Access 1 Models

TL;DR

This paper investigates how fine-tuning large language models on insecure datasets across various domains affects their susceptibility to emergent misalignment, revealing significant domain variability and potential predictors for misalignment risk.

Contribution

It introduces a systematic evaluation of domain-level susceptibility to misalignment, provides a taxonomic ranking, and standardizes dataset construction for future research.

Findings

01

Backdoor triggers increase misalignment in 77.8% of domains.

02

Domain vulnerability varies from 0% to 87.67%.

03

Membership inference metrics predict potential misalignment.

Abstract

Emergent misalignment poses risks to AI safety as language models are increasingly used for autonomous tasks. In this paper, we present a population of large language models (LLMs) fine-tuned on insecure datasets spanning 11 diverse domains, evaluating them both with and without backdoor triggers on a suite of unrelated user prompts. Our evaluation experiments on \texttt{Qwen2.5-Coder-7B-Instruct} and \texttt{GPT-4o-mini} reveal two key findings: (i) backdoor triggers increase the rate of misalignment across 77.8% of domains (average drop: 4.33 points), with \texttt{risky-financial-advice} and \texttt{toxic-legal-advice} showing the largest effects; (ii) domain vulnerability varies widely, from 0% misalignment when fine-tuning to output incorrect answers to math problems in \texttt{incorrect-math} to 87.67% when fine-tuned on \texttt{gore-movie-trivia}. In further experiments in…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

🤗
abhishek9909/misaligned-model-cards
model· ♡ 1
♡ 1

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdversarial Robustness in Machine Learning · Artificial Intelligence in Healthcare and Education · Ethics and Social Impacts of AI