Automating Prompt Leakage Attacks on Large Language Models Using Agentic Approach
Tvrtko Sternak, Davor Runje, Dorian Grano\v{s}a, Chi Wang

TL;DR
This paper introduces a systematic, agent-based framework inspired by cryptography to evaluate and improve the security of large language models against prompt leakage attacks, enabling rigorous testing of their robustness.
Contribution
It presents a novel multi-agent testing framework for prompt leakage, bridging automated threat modeling with practical LLM security evaluation.
Findings
Developed an agentic approach to probe LLMs for prompt leakage.
Defined a cryptographically inspired standard for prompt leakage safety.
Provided an open-source implementation for practical testing.
Abstract
This paper presents a novel approach to evaluating the security of large language models (LLMs) against prompt leakage-the exposure of system-level prompts or proprietary configurations. We define prompt leakage as a critical threat to secure LLM deployment and introduce a framework for testing the robustness of LLMs using agentic teams. Leveraging AG2 (formerly AutoGen), we implement a multi-agent system where cooperative agents are tasked with probing and exploiting the target LLM to elicit its prompt. Guided by traditional definitions of security in cryptography, we further define a prompt leakage-safe system as one in which an attacker cannot distinguish between two agents: one initialized with an original prompt and the other with a prompt stripped of all sensitive information. In a safe system, the agents' outputs will be indistinguishable to the attacker, ensuring that…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling
