DRAGON: Guard LLM Unlearning in Context via Negative Detection and Reasoning

Yaxuan Wang; Chris Yuhao Liu; Quan Liu; Jinglong Pang; Wei Wei; Yujia Bao; Yang Liu

arXiv:2511.05784·cs.CL·November 12, 2025

DRAGON: Guard LLM Unlearning in Context via Negative Detection and Reasoning

Yaxuan Wang, Chris Yuhao Liu, Quan Liu, Jinglong Pang, Wei Wei, Yujia Bao, Yang Liu

PDF

Open Access

TL;DR

DRAGON is a reasoning-based framework that enhances the unlearning of private or harmful data in LLMs without needing retain data, using in-context detection and reasoning to ensure safe model behavior.

Contribution

It introduces a novel, instruction-based unlearning method that does not require access to retain data and employs in-context reasoning for effective unlearning in practical scenarios.

Findings

01

DRAGON achieves strong unlearning performance across multiple tasks.

02

It demonstrates scalability and effectiveness without retraining the base model.

03

The framework introduces new metrics for evaluating unlearning performance.

Abstract

Unlearning in Large Language Models (LLMs) is crucial for protecting private data and removing harmful knowledge. Most existing approaches rely on fine-tuning to balance unlearning efficiency with general language capabilities. However, these methods typically require training or access to retain data, which is often unavailable in real world scenarios. Although these methods can perform well when both forget and retain data are available, few works have demonstrated equivalent capability in more practical, data-limited scenarios. To overcome these limitations, we propose Detect-Reasoning Augmented GeneratiON (DRAGON), a systematic, reasoning-based framework that utilizes in-context chain-of-thought (CoT) instructions to guard deployed LLMs before inference. Instead of modifying the base model, DRAGON leverages the inherent instruction-following ability of LLMs and introduces a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdversarial Robustness in Machine Learning · Topic Modeling · Data Quality and Management