Step-by-Step Reasoning Attack: Revealing 'Erased' Knowledge in Large Language Models

Yash Sinha; Manit Baser; Murari Mandal; Dinil Mon Divakaran; Mohan Kankanhalli

arXiv:2506.17279·cs.CR·June 24, 2025

Step-by-Step Reasoning Attack: Revealing 'Erased' Knowledge in Large Language Models

Yash Sinha, Manit Baser, Murari Mandal, Dinil Mon Divakaran, Mohan Kankanhalli

PDF

TL;DR

This paper introduces a step-by-step reasoning-based black-box attack called Sleek that exposes failures in current knowledge erasure methods in large language models, revealing hidden or suppressed information.

Contribution

It presents a novel attack framework leveraging structured prompts to systematically recover erased knowledge in LLMs, exposing weaknesses in existing unlearning techniques.

Findings

01

62.5% success in retrieving forgotten Harry Potter facts

02

50% of prompts exposed unfair suppression of retained knowledge

03

Existing unlearning methods are vulnerable to structured reasoning attacks

Abstract

Knowledge erasure in large language models (LLMs) is important for ensuring compliance with data and AI regulations, safeguarding user privacy, mitigating bias, and misinformation. Existing unlearning methods aim to make the process of knowledge erasure more efficient and effective by removing specific knowledge while preserving overall model performance, especially for retained information. However, it has been observed that the unlearning techniques tend to suppress and leave the knowledge beneath the surface, thus making it retrievable with the right prompts. In this work, we demonstrate that \textit{step-by-step reasoning} can serve as a backdoor to recover this hidden information. We introduce a step-by-step reasoning-based black-box attack, Sleek, that systematically exposes unlearning failures. We employ a structured attack framework with three core components: (1) an adversarial…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.