REBEL: Hidden Knowledge Recovery via Evolutionary-Based Evaluation Loop
Patryk Rybak, Pawe{\l} Batorski, Paul Swoboda, Przemys{\l}aw Spurek

TL;DR
REBEL introduces an evolutionary prompt generation method to effectively test whether unlearning techniques truly remove sensitive knowledge from large language models, revealing that many existing methods only superficially forget information.
Contribution
The paper presents REBEL, an adversarial prompt generation framework that challenges current unlearning methods, exposing their limitations in genuinely erasing knowledge from models.
Findings
REBEL achieves up to 60% attack success rate on TOFU.
REBEL reaches up to 93% success on WMDP.
Current unlearning methods often only superficially forget information.
Abstract
Machine unlearning for LLMs aims to remove sensitive or copyrighted data from trained models. However, the true efficacy of current unlearning methods remains uncertain. Standard evaluation metrics rely on benign queries that often mistake superficial information suppression for genuine knowledge removal. Such metrics fail to detect residual knowledge that more sophisticated prompting strategies could still extract. We introduce REBEL, an evolutionary approach for adversarial prompt generation designed to probe whether unlearned data can still be recovered. Our experiments demonstrate that REBEL successfully elicits ``forgotten'' knowledge from models that seemed to be forgotten in standard unlearning benchmarks, revealing that current unlearning methods may provide only a superficial layer of protection. We validate our framework on subsets of the TOFU and WMDP benchmarks, evaluating…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdversarial Robustness in Machine Learning · Privacy-Preserving Technologies in Data · Explainable Artificial Intelligence (XAI)
