Automating Prompt Leakage Attacks on Large Language Models Using Agentic   Approach

Tvrtko Sternak; Davor Runje; Dorian Grano\v{s}a; Chi Wang

arXiv:2502.12630·cs.CR·February 19, 2025

Automating Prompt Leakage Attacks on Large Language Models Using Agentic Approach

Tvrtko Sternak, Davor Runje, Dorian Grano\v{s}a, Chi Wang

PDF

Open Access 1 Repo

TL;DR

This paper introduces a systematic, agent-based framework inspired by cryptography to evaluate and improve the security of large language models against prompt leakage attacks, enabling rigorous testing of their robustness.

Contribution

It presents a novel multi-agent testing framework for prompt leakage, bridging automated threat modeling with practical LLM security evaluation.

Findings

01

Developed an agentic approach to probe LLMs for prompt leakage.

02

Defined a cryptographically inspired standard for prompt leakage safety.

03

Provided an open-source implementation for practical testing.

Abstract

This paper presents a novel approach to evaluating the security of large language models (LLMs) against prompt leakage-the exposure of system-level prompts or proprietary configurations. We define prompt leakage as a critical threat to secure LLM deployment and introduce a framework for testing the robustness of LLMs using agentic teams. Leveraging AG2 (formerly AutoGen), we implement a multi-agent system where cooperative agents are tasked with probing and exploiting the target LLM to elicit its prompt. Guided by traditional definitions of security in cryptography, we further define a prompt leakage-safe system as one in which an attacker cannot distinguish between two agents: one initialized with an original prompt and the other with a prompt stripped of all sensitive information. In a safe system, the agents' outputs will be indistinguishable to the attacker, ensuring that…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

sternakt/prompt-leakage-probing
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling