SafeLLM: Unlearning Harmful Outputs from Large Language Models against Jailbreak Attacks

Xiangman Li; Xiaodong Wu; Qi Li; Jianbing Ni; and Rongxing Lu

arXiv:2508.15182·cs.LG·August 22, 2025

SafeLLM: Unlearning Harmful Outputs from Large Language Models against Jailbreak Attacks

Xiangman Li, Xiaodong Wu, Qi Li, Jianbing Ni, and Rongxing Lu

PDF

Open Access

TL;DR

SafeLLM introduces an unlearning framework that effectively reduces harmful outputs in large language models caused by jailbreak prompts, while preserving their overall linguistic and functional capabilities.

Contribution

The paper presents a novel unlearning-based method with a three-stage pipeline to neutralize harmful knowledge in LLMs, improving safety without sacrificing performance.

Findings

01

Significantly lowers attack success rates across multiple benchmarks.

02

Maintains high general-purpose performance after unlearning.

03

Outperforms standard defense methods in safety and robustness.

Abstract

Jailbreak attacks pose a serious threat to the safety of Large Language Models (LLMs) by crafting adversarial prompts that bypass alignment mechanisms, causing the models to produce harmful, restricted, or biased content. In this paper, we propose SafeLLM, a novel unlearning-based defense framework that unlearn the harmful knowledge from LLMs while preserving linguistic fluency and general capabilities. SafeLLM employs a three-stage pipeline: (1) dynamic unsafe output detection using a hybrid approach that integrates external classifiers with model-internal evaluations; (2) token-level harmful content tracing through feedforward network (FFN) activations to localize harmful knowledge; and (3) constrained optimization to suppress unsafe behavior without degrading overall model quality. SafeLLM achieves targeted and irreversible forgetting by identifying and neutralizing FFN substructures…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdversarial Robustness in Machine Learning · Topic Modeling · Hate Speech and Cyberbullying Detection