CTRAP: Embedding Collapse Trap to Safeguard Large Language Models from Harmful Fine-Tuning

Biao Yi; Tiansheng Huang; Baolei Zhang; Tong Li; Lihai Nie; Zheli Liu; Li Shen

arXiv:2505.16559·cs.CR·May 23, 2025

CTRAP: Embedding Collapse Trap to Safeguard Large Language Models from Harmful Fine-Tuning

Biao Yi, Tiansheng Huang, Baolei Zhang, Tong Li, Lihai Nie, Zheli Liu, Li Shen

PDF

Open Access

TL;DR

This paper introduces CTRAP, a novel method to safeguard large language models from harmful fine-tuning by inducing a controlled collapse that neutralizes malicious adaptations without affecting benign use.

Contribution

The paper proposes a paradigm shift from selective unlearning to inducing model collapse to prevent harmful fine-tuning, with a practical mechanism called CTRAP that activates under malicious updates.

Findings

01

CTRAP effectively neutralizes harmful fine-tuning across various LLMs.

02

CTRAP maintains model utility during benign fine-tuning.

03

Empirical results show high robustness of CTRAP against different attack scenarios.

Abstract

Fine-tuning-as-a-service, while commercially successful for Large Language Model (LLM) providers, exposes models to harmful fine-tuning attacks. As a widely explored defense paradigm against such attacks, unlearning attempts to remove malicious knowledge from LLMs, thereby essentially preventing them from being used to perform malicious tasks. However, we highlight a critical flaw: the powerful general adaptability of LLMs allows them to easily bypass selective unlearning by rapidly relearning or repurposing their capabilities for harmful tasks. To address this fundamental limitation, we propose a paradigm shift: instead of selective removal, we advocate for inducing model collapse--effectively forcing the model to "unlearn everything"--specifically in response to updates characteristic of malicious adaptation. This collapse directly neutralizes the very general capabilities that…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdversarial Robustness in Machine Learning · Topic Modeling · Artificial Intelligence in Healthcare and Education