Language Models Resist Alignment: Evidence From Data Compression

Jiaming Ji; Kaile Wang; Tianyi Qiu; Boyuan Chen; Jiayi Zhou; Changye Li; Hantao Lou; Juntao Dai; Yunhuai Liu; Yaodong Yang

arXiv:2406.06144·cs.CL·September 24, 2025·2 cites

Language Models Resist Alignment: Evidence From Data Compression

Jiaming Ji, Kaile Wang, Tianyi Qiu, Boyuan Chen, Jiayi Zhou, Changye Li, Hantao Lou, Juntao Dai, Yunhuai Liu, Yaodong Yang

PDF

Open Access 1 Repo 1 Video

TL;DR

This paper investigates the resilience of large language models to alignment efforts, revealing that fine-tuning effects are often temporary and models tend to revert to pre-training behaviors, especially as models grow larger.

Contribution

It provides the first combined theoretical and empirical analysis showing the elasticity of LLMs and how fine-tuning impacts alignment, highlighting the challenges in achieving robust alignment.

Findings

01

Models tend to revert to pre-training behavior after fine-tuning.

02

Elasticity increases with model size and pre-training data.

03

Fine-tuning's impact diminishes over time, favoring pre-training distribution.

Abstract

Large language models (LLMs) may exhibit unintended or undesirable behaviors. Recent works have concentrated on aligning LLMs to mitigate harmful outputs. Despite these efforts, some anomalies indicate that even a well-conducted alignment process can be easily circumvented, whether intentionally or accidentally. Does alignment fine-tuning yield have robust effects on models, or are its impacts merely superficial? In this work, we make the first exploration of this phenomenon from both theoretical and empirical perspectives. Empirically, we demonstrate the $elasticity$ of post-alignment models, i.e., the tendency to revert to the behavior distribution formed during the pre-training phase upon further fine-tuning. Leveraging compression theory, we formally deduce that fine-tuning disproportionately undermines alignment relative to pre-training, potentially by orders of magnitude.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

pku-alignment/llms-resist-alignment
pytorchOfficial

Videos

Language Models Resist Alignment: Evidence From Data Compression· underline

Taxonomy

TopicsNatural Language Processing Techniques