Discovering Universal Activation Directions for PII Leakage in Language Models
Leo Marchyok, Zachary Coalson, Sungho Keum, Sooel Son, Sanghyun Hong

TL;DR
This paper introduces UniLeak, a framework that identifies universal activation directions in language models which, when manipulated, can significantly increase the likelihood of PII leakage, revealing a latent signal responsible for privacy risks.
Contribution
UniLeak provides a novel mechanistic-interpretability method to find universal activation directions related to PII leakage without needing training data or groundtruth PII.
Findings
Steering along identified directions increases PII leakage substantially.
The method generalizes across different models and datasets.
Minimal impact on overall language generation quality.
Abstract
Modern language models exhibit rich internal structure, yet little is known about how privacy-sensitive behaviors, such as personally identifiable information (PII) leakage, are represented and modulated within their hidden states. We present UniLeak, a mechanistic-interpretability framework that identifies universal activation directions: latent directions in a model's residual stream whose linear addition at inference time consistently increases the likelihood of generating PII across prompts. These model-specific directions generalize across contexts and amplify PII generation probability, with minimal impact on generation quality. UniLeak recovers such directions without access to training data or groundtruth PII, relying only on self-generated text. Across multiple models and datasets, steering along these universal directions substantially increases PII leakage compared to…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdversarial Robustness in Machine Learning · Topic Modeling · Explainable Artificial Intelligence (XAI)
