Rendering Data Unlearnable by Exploiting LLM Alignment Mechanisms

Ruihan Zhang; Jun Sun

arXiv:2601.03401·cs.CL·January 8, 2026

Rendering Data Unlearnable by Exploiting LLM Alignment Mechanisms

Ruihan Zhang, Jun Sun

PDF

Open Access

TL;DR

This paper introduces Disclaimer Injection, a novel method that uses alignment-triggering disclaimers to make data unlearnable for LLMs, protecting proprietary information without altering training processes.

Contribution

The paper presents the first practical data protection method exploiting LLM alignment mechanisms to prevent unwanted learning at scale.

Findings

01

Models trained on protected data show significant performance degradation.

02

Layer-wise analysis reveals persistent activation of alignment layers during fine-tuning.

03

The approach effectively restricts data learnability without modifying training pipelines.

Abstract

Large language models (LLMs) are increasingly trained on massive, heterogeneous text corpora, raising serious concerns about the unauthorised use of proprietary or personal data during model training. In this work, we address the problem of data protection against unwanted model learning in a realistic black-box setting. We propose Disclaimer Injection, a novel data-level defence that renders text unlearnable to LLMs. Rather than relying on model-side controls or explicit data removal, our approach exploits the models' own alignment mechanisms: by injecting carefully designed alignment-triggering disclaimers to prevent effective learning. Through layer-wise analysis, we find that fine-tuning on such protected data induces persistent activation of alignment-related layers, causing alignment constraints to override task learning even on common inputs. Consequently, models trained on such…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdversarial Robustness in Machine Learning · Topic Modeling · Artificial Intelligence in Healthcare and Education