Self-Destructive Language Model
Yuhui Wang, Rongyi Zhu, Ting Wang

TL;DR
This paper introduces SEAM, a novel defense mechanism that makes large language models self-destructive when fine-tuned on harmful data, thereby preventing adversarial attacks while maintaining legitimate task performance.
Contribution
SEAM is a new alignment method that couples benign and harmful data optimization, making models resilient to harmful fine-tuning attacks with an efficient Hessian-free gradient estimate.
Findings
Achieves state-of-the-art robustness against low-intensity attacks.
Models exhibit catastrophic failure under high-intensity attacks.
Effective balance between task performance and resistance to harmful data.
Abstract
Harmful fine-tuning attacks pose a major threat to the security of large language models (LLMs), allowing adversaries to compromise safety guardrails with minimal harmful data. While existing defenses attempt to reinforce LLM alignment, they fail to address models' inherent "trainability" on harmful data, leaving them vulnerable to stronger attacks with increased learning rates or larger harmful datasets. To overcome this critical limitation, we introduce SEAM, a novel alignment-enhancing defense that transforms LLMs into self-destructive models with intrinsic resilience to misalignment attempts. Specifically, these models retain their capabilities for legitimate tasks while exhibiting substantial performance degradation when fine-tuned on harmful data. The protection is achieved through a novel loss function that couples the optimization trajectories of benign and harmful data,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdversarial Robustness in Machine Learning · Topic Modeling · Explainable Artificial Intelligence (XAI)
MethodsSelf-supervised Equivariant Attention Mechanism
