REBEL: Hidden Knowledge Recovery via Evolutionary-Based Evaluation Loop

Patryk Rybak; Pawe{\l} Batorski; Paul Swoboda; Przemys{\l}aw Spurek

arXiv:2602.06248·cs.LG·February 9, 2026

REBEL: Hidden Knowledge Recovery via Evolutionary-Based Evaluation Loop

Patryk Rybak, Pawe{\l} Batorski, Paul Swoboda, Przemys{\l}aw Spurek

PDF

Open Access 1 Datasets

TL;DR

REBEL introduces an evolutionary prompt generation method to effectively test whether unlearning techniques truly remove sensitive knowledge from large language models, revealing that many existing methods only superficially forget information.

Contribution

The paper presents REBEL, an adversarial prompt generation framework that challenges current unlearning methods, exposing their limitations in genuinely erasing knowledge from models.

Findings

01

REBEL achieves up to 60% attack success rate on TOFU.

02

REBEL reaches up to 93% success on WMDP.

03

Current unlearning methods often only superficially forget information.

Abstract

Machine unlearning for LLMs aims to remove sensitive or copyrighted data from trained models. However, the true efficacy of current unlearning methods remains uncertain. Standard evaluation metrics rely on benign queries that often mistake superficial information suppression for genuine knowledge removal. Such metrics fail to detect residual knowledge that more sophisticated prompting strategies could still extract. We introduce REBEL, an evolutionary approach for adversarial prompt generation designed to probe whether unlearned data can still be recovered. Our experiments demonstrate that REBEL successfully elicits ``forgotten'' knowledge from models that seemed to be forgotten in standard unlearning benchmarks, revealing that current unlearning methods may provide only a superficial layer of protection. We validate our framework on subsets of the TOFU and WMDP benchmarks, evaluating…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

patryk-rybak/rebel-benchmark
dataset· 13 dl
13 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdversarial Robustness in Machine Learning · Privacy-Preserving Technologies in Data · Explainable Artificial Intelligence (XAI)