CAP: Controllable Alignment Prompting for Unlearning in LLMs
Zhaokun Wang, Jinyu Guo, Jingwen Pu, Hongli Pu, Meng Yang, Xunlei Chen, Jie Ou, Wenyi Li, Guangchun Luo, Wenhong Tian

TL;DR
The paper introduces CAP, a prompt-based framework enabling controllable, reversible unlearning of specific knowledge in large language models without modifying their parameters.
Contribution
CAP offers a novel, end-to-end prompt optimization approach using reinforcement learning for targeted knowledge unlearning in LLMs, overcoming previous limitations.
Findings
CAP achieves precise, controllable unlearning without parameter updates.
The framework enables reversible knowledge restoration via prompt revocation.
Experiments show CAP outperforms prior methods in unlearning accuracy and control.
Abstract
Large language models (LLMs) trained on unfiltered corpora inherently risk retaining sensitive information, necessitating selective knowledge unlearning for regulatory compliance and ethical safety. However, existing parameter-modifying methods face fundamental limitations: high computational costs, uncontrollable forgetting boundaries, and strict dependency on model weight access. These constraints render them impractical for closed-source models, yet current non-invasive alternatives remain unsystematic and reliant on empirical experience. To address these challenges, we propose the Controllable Alignment Prompting for Unlearning (CAP) framework, an end-to-end prompt-driven unlearning paradigm. CAP decouples unlearning into a learnable prompt optimization process via reinforcement learning, where a prompt generator collaborates with the LLM to suppress target knowledge while…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
