PII Jailbreaking in LLMs via Activation Steering Reveals Personal Information Leakage

Krishna Kanth Nakka; Xue Jiang; Dmitrii Usynin; Xuebing Zhou

arXiv:2507.02332·cs.CR·August 20, 2025

PII Jailbreaking in LLMs via Activation Steering Reveals Personal Information Leakage

Krishna Kanth Nakka, Xue Jiang, Dmitrii Usynin, Xuebing Zhou

PDF

TL;DR

This study demonstrates that by steering internal activations, one can bypass LLM safeguards to extract and disclose sensitive personal information that the models are designed to refuse sharing.

Contribution

The paper introduces a novel activation steering method to reveal private data in LLMs, exposing privacy vulnerabilities through targeted internal manipulation.

Findings

01

At least 95% jailbreaking disclosure rate across four LLMs

02

Over 50% of steered responses reveal true personal information

03

Private data such as life events and relationships can be extracted

Abstract

This paper investigates privacy jailbreaking in LLMs via steering, focusing on whether manipulating activations can bypass LLM alignment and alter response behaviors to privacy related queries (e.g., a certain public figure's sexual orientation). We begin by identifying attention heads predictive of refusal behavior for private attributes (e.g., sexual orientation) using lightweight linear probes trained with privacy evaluator labels. Next, we steer the activations of a small subset of these attention heads guided by the trained probes to induce the model to generate non-refusal responses. Our experiments show that these steered responses often disclose sensitive attribute details, along with other private information about data subjects such as life events, relationships, and personal histories that the models would typically refuse to produce. Evaluations across four LLMs reveal…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.