SafeLLM: Unlearning Harmful Outputs from Large Language Models against Jailbreak Attacks
Xiangman Li, Xiaodong Wu, Qi Li, Jianbing Ni, and Rongxing Lu

TL;DR
SafeLLM introduces an unlearning framework that effectively reduces harmful outputs in large language models caused by jailbreak prompts, while preserving their overall linguistic and functional capabilities.
Contribution
The paper presents a novel unlearning-based method with a three-stage pipeline to neutralize harmful knowledge in LLMs, improving safety without sacrificing performance.
Findings
Significantly lowers attack success rates across multiple benchmarks.
Maintains high general-purpose performance after unlearning.
Outperforms standard defense methods in safety and robustness.
Abstract
Jailbreak attacks pose a serious threat to the safety of Large Language Models (LLMs) by crafting adversarial prompts that bypass alignment mechanisms, causing the models to produce harmful, restricted, or biased content. In this paper, we propose SafeLLM, a novel unlearning-based defense framework that unlearn the harmful knowledge from LLMs while preserving linguistic fluency and general capabilities. SafeLLM employs a three-stage pipeline: (1) dynamic unsafe output detection using a hybrid approach that integrates external classifiers with model-internal evaluations; (2) token-level harmful content tracing through feedforward network (FFN) activations to localize harmful knowledge; and (3) constrained optimization to suppress unsafe behavior without degrading overall model quality. SafeLLM achieves targeted and irreversible forgetting by identifying and neutralizing FFN substructures…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdversarial Robustness in Machine Learning · Topic Modeling · Hate Speech and Cyberbullying Detection
