Do LLMs Know What Is Private Internally? Probing and Steering Contextual Privacy Norms in Large Language Model Representations
Haoran Wang, Li Xiong, Kai Shu

TL;DR
This paper investigates whether large language models internally encode privacy norms based on contextual integrity theory, revealing a structured representation that can be manipulated to reduce privacy violations.
Contribution
It introduces a systematic analysis of privacy norm representations in LLMs and proposes a structured steering method to improve privacy control.
Findings
Privacy norms are encoded as linearly separable directions in LLMs.
Models still leak private information despite internal norm representations.
Structured steering along CI dimensions reduces privacy violations more effectively.
Abstract
Large language models (LLMs) are increasingly deployed in high-stakes settings, yet they frequently violate contextual privacy by disclosing private information in situations where humans would exercise discretion. This raises a fundamental question: do LLMs internally encode contextual privacy norms, and if so, why do violations persist? We present the first systematic study of contextual privacy as a structured latent representation in LLMs, grounded in contextual integrity (CI) theory. Probing multiple models, we find that the three norm-determining CI parameters (information type, recipient, and transmission principle) are encoded as linearly separable and functionally independent directions in activation space. Despite this internal structure, models still leak private information in practice, revealing a clear gap between concept representation and model behavior. To bridge this…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
