TL;DR
This paper presents AP-Test, a novel method using adversarial prompts to accurately identify guardrails in black-box large language models, enhancing understanding of safety mechanisms.
Contribution
The paper introduces AP-Test, a new approach employing guard-specific adversarial prompts and testing strategies for precise guardrail identification in LLMs.
Findings
AP-Test achieves perfect classification accuracy in diverse scenarios.
The method effectively distinguishes different guardrails in black-box models.
Ablation studies confirm the importance of each component in AP-Test.
Abstract
With the rapid adoption of large language models (LLMs), conversational AI agents have become widely deployed across real-world applications. To enhance safety, these agents are often equipped with guardrails that moderate harmful content. Identifying the guardrails in an agent thus becomes critical for adversaries to understand the system and design guard-specific attacks. In this work, we introduce AP-Test, a novel approach that leverages guard-specific adversarial prompts to detect the identity of guardrails deployed in black-box AI agents. Our method addresses key challenges in this task, including the influence of safety-aligned LLMs and other guardrails, as well as a lack of principled decision-making strategies. AP-Test employs two complementary testing strategies, input and output guard tests, and a new metric, match score, to enable robust identification. Experiments across…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
