DRAGON: Guard LLM Unlearning in Context via Negative Detection and Reasoning
Yaxuan Wang, Chris Yuhao Liu, Quan Liu, Jinglong Pang, Wei Wei, Yujia Bao, Yang Liu

TL;DR
DRAGON is a reasoning-based framework that enhances the unlearning of private or harmful data in LLMs without needing retain data, using in-context detection and reasoning to ensure safe model behavior.
Contribution
It introduces a novel, instruction-based unlearning method that does not require access to retain data and employs in-context reasoning for effective unlearning in practical scenarios.
Findings
DRAGON achieves strong unlearning performance across multiple tasks.
It demonstrates scalability and effectiveness without retraining the base model.
The framework introduces new metrics for evaluating unlearning performance.
Abstract
Unlearning in Large Language Models (LLMs) is crucial for protecting private data and removing harmful knowledge. Most existing approaches rely on fine-tuning to balance unlearning efficiency with general language capabilities. However, these methods typically require training or access to retain data, which is often unavailable in real world scenarios. Although these methods can perform well when both forget and retain data are available, few works have demonstrated equivalent capability in more practical, data-limited scenarios. To overcome these limitations, we propose Detect-Reasoning Augmented GeneratiON (DRAGON), a systematic, reasoning-based framework that utilizes in-context chain-of-thought (CoT) instructions to guard deployed LLMs before inference. Instead of modifying the base model, DRAGON leverages the inherent instruction-following ability of LLMs and introduces a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdversarial Robustness in Machine Learning · Topic Modeling · Data Quality and Management
