Beyond Jailbreaking: Auditing Contextual Privacy in LLM Agents

Saswat Das; Jameson Sandler; Ferdinando Fioretto

arXiv:2506.10171·cs.CR·September 30, 2025

Beyond Jailbreaking: Auditing Contextual Privacy in LLM Agents

Saswat Das, Jameson Sandler, Ferdinando Fioretto

PDF

Open Access 3 Reviews

TL;DR

This paper introduces CMPL, a comprehensive auditing framework that stress-tests LLM agents for latent privacy vulnerabilities through multi-turn interactions, revealing risks beyond single-turn defenses.

Contribution

It presents the novel CMPL framework for systematic privacy risk assessment in LLM agents, including an open benchmark and quantifiable metrics for multi-turn vulnerability detection.

Findings

01

CMPL uncovers privacy risks undetected by single-turn defenses.

02

The framework reveals temporal dynamics and adaptive strategies of adversaries.

03

Evaluation across diverse domains demonstrates its broad applicability.

Abstract

LLM agents have begun to appear as personal assistants, customer service bots, and clinical aides. While these applications deliver substantial operational benefits, they also require continuous access to sensitive data, which increases the likelihood of unauthorized disclosures. Moreover, these disclosures go beyond mere explicit disclosure, leaving open avenues for gradual manipulation or sidechannel information leakage. This study proposes an auditing framework for conversational privacy that quantifies an agent's susceptibility to these risks. The proposed Conversational Manipulation for Privacy Leakage (CMPL) framework is designed to stress-test agents that enforce strict privacy directives against an iterative probing strategy. Rather than focusing solely on a single disclosure event or purely explicit leakage, CMPL simulates realistic multi-turn interactions to systematically…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 4Confidence 3

Strengths

* Adversarial attacks against LLM agents to induce unintended disclosure are practically important, and have started to recently gain attention. The paper is timely. * The paper leverages notions from contextual integrity, which provides an interesting platform for analyzing privacy in conversational settings.

Weaknesses

* One of my main concerns is that the paper does not put its contributions into the context with respect to recent works on contextual privacy. [Bagdasaryan et al., 2024] is presented in the paper as a jailbreaking attack, whereas [Bagdasaryan et al., 2024] used notions from CI to propose adversarial context hijacking attacks and proposed a defense to prevent unintended leakage by LLM agents. Notions from [Bagdasaryan et al., 2024] such as privacy directives and information profile are used in t

Reviewer 02Rating 8Confidence 4

Strengths

- This paper makes a substantial contribution by expanding the investigation of privacy threats in LLM agents from a single disclosure event to a dynamic, multi-turn interaction scenario. This captures a more realistic problem setup and also demonstrates a higher attack success rate, showing that the privacy vulnerabilities in these agents are even more severe than what has been revealed in prior work. - The paper is clearly and effectively written. The evaluation is thorough and methodologicall

Weaknesses

- Although the audit framework is useful for understanding the capabilities of the attackers, I feel the paper has a limited study and discussion of mitigations against this type of adaptive attack. Specifically, the application agent does not explore alternative designs or varied ways to safeguard information. This raises questions about whether the high attack success rate reflects the current frontier of agent privacy capabilities, or whether it is more an artifact that could be addressed wit

Reviewer 03Rating 6Confidence 4

Strengths

- Threat model with multi-turn adversaries is novel and has not been studied before - This setup requires sophisticated auditor and tracking the leakage across turns. - Paper identifies different strategies by the adversaries. - Formal definitions for leakages

Weaknesses

- Auditor uses LLM for determining leakage opens up to missing leakage or masked leakage - A victim LLM might be sharing much more data than Auditor can catch, for example answers like Yes/No might be hard to spot when the question to the victim is obfuscated for the auditor - Scenarios could be extended to different settings, i.e. trading, appointments, etc - I would like to have a more substantial discussion on defenses

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsCybercrime and Law Enforcement Studies

Methodstravel james