The Art of (Mis)alignment: How Fine-Tuning Methods Effectively Misalign and Realign LLMs in Post-Training

Rui Zhang; Hongwei Li; Yun Shen; Xinyue Shen; Wenbo Jiang; Guowen Xu; Yang Liu; Michael Backes; Yang Zhang

arXiv:2604.07754·cs.CR·April 10, 2026

The Art of (Mis)alignment: How Fine-Tuning Methods Effectively Misalign and Realign LLMs in Post-Training

Rui Zhang, Hongwei Li, Yun Shen, Xinyue Shen, Wenbo Jiang, Guowen Xu, Yang Liu, Michael Backes, Yang Zhang

PDF

1 Repo

TL;DR

This paper investigates how different fine-tuning methods can cause and correct misalignment in large language models, revealing asymmetries between attack and defense mechanisms and emphasizing the need for robust safety strategies.

Contribution

It systematically evaluates fine-tuning techniques for misalignment and realignment, uncovering their effectiveness and limitations across multiple models and proposing insights for safer LLM deployment.

Findings

01

ORPO is most effective for inducing misalignment.

02

DPO excels in realignment but reduces model utility.

03

Model-specific resistance affects misalignment and realignment outcomes.

Abstract

The deployment of large language models (LLMs) raises significant ethical and safety concerns. While LLM alignment techniques are adopted to improve model safety and trustworthiness, adversaries can exploit these techniques to undermine safety for malicious purposes, resulting in \emph{misalignment}. Misaligned LLMs may be published on open platforms to magnify harm. To address this, additional safety alignment, referred to as \emph{realignment}, is necessary before deploying untrusted third-party LLMs. This study explores the efficacy of fine-tuning methods in terms of misalignment, realignment, and the effects of their interplay. By evaluating four Supervised Fine-Tuning (SFT) and two Preference Fine-Tuning (PFT) methods across four popular safety-aligned LLMs, we reveal a mechanism asymmetry between attack and defense. While Odds Ratio Preference Optimization (ORPO) is most effective…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

zhangrui4041/The-Art-of-Mis-alignment
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.