Step-by-Step Reasoning Attack: Revealing 'Erased' Knowledge in Large Language Models
Yash Sinha, Manit Baser, Murari Mandal, Dinil Mon Divakaran, Mohan Kankanhalli

TL;DR
This paper introduces a step-by-step reasoning-based black-box attack called Sleek that exposes failures in current knowledge erasure methods in large language models, revealing hidden or suppressed information.
Contribution
It presents a novel attack framework leveraging structured prompts to systematically recover erased knowledge in LLMs, exposing weaknesses in existing unlearning techniques.
Findings
62.5% success in retrieving forgotten Harry Potter facts
50% of prompts exposed unfair suppression of retained knowledge
Existing unlearning methods are vulnerable to structured reasoning attacks
Abstract
Knowledge erasure in large language models (LLMs) is important for ensuring compliance with data and AI regulations, safeguarding user privacy, mitigating bias, and misinformation. Existing unlearning methods aim to make the process of knowledge erasure more efficient and effective by removing specific knowledge while preserving overall model performance, especially for retained information. However, it has been observed that the unlearning techniques tend to suppress and leave the knowledge beneath the surface, thus making it retrievable with the right prompts. In this work, we demonstrate that \textit{step-by-step reasoning} can serve as a backdoor to recover this hidden information. We introduce a step-by-step reasoning-based black-box attack, Sleek, that systematically exposes unlearning failures. We employ a structured attack framework with three core components: (1) an adversarial…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
