Discovering Universal Activation Directions for PII Leakage in Language Models

Leo Marchyok; Zachary Coalson; Sungho Keum; Sooel Son; Sanghyun Hong

arXiv:2602.16980·cs.LG·February 20, 2026

Discovering Universal Activation Directions for PII Leakage in Language Models

Leo Marchyok, Zachary Coalson, Sungho Keum, Sooel Son, Sanghyun Hong

PDF

Open Access

TL;DR

This paper introduces UniLeak, a framework that identifies universal activation directions in language models which, when manipulated, can significantly increase the likelihood of PII leakage, revealing a latent signal responsible for privacy risks.

Contribution

UniLeak provides a novel mechanistic-interpretability method to find universal activation directions related to PII leakage without needing training data or groundtruth PII.

Findings

01

Steering along identified directions increases PII leakage substantially.

02

The method generalizes across different models and datasets.

03

Minimal impact on overall language generation quality.

Abstract

Modern language models exhibit rich internal structure, yet little is known about how privacy-sensitive behaviors, such as personally identifiable information (PII) leakage, are represented and modulated within their hidden states. We present UniLeak, a mechanistic-interpretability framework that identifies universal activation directions: latent directions in a model's residual stream whose linear addition at inference time consistently increases the likelihood of generating PII across prompts. These model-specific directions generalize across contexts and amplify PII generation probability, with minimal impact on generation quality. UniLeak recovers such directions without access to training data or groundtruth PII, relying only on self-generated text. Across multiple models and datasets, steering along these universal directions substantially increases PII leakage compared to…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdversarial Robustness in Machine Learning · Topic Modeling · Explainable Artificial Intelligence (XAI)