Effective Red-Teaming of Policy-Adherent Agents
Itay Nakash, George Kour, Koren Lazar, Matan Vetzler, Guy Uziel, Ateret Anaby-Tavor

TL;DR
This paper introduces CRAFT, a multi-agent red-teaming system that tests the robustness of policy-adherent language models against adversarial manipulation, revealing vulnerabilities and the need for improved defenses.
Contribution
It presents a novel threat model, a red-teaming system, and a benchmark to evaluate and improve the resilience of policy-adherent agents against malicious users.
Findings
CRAFT outperforms traditional jailbreak methods in undermining policy adherence.
Existing defenses are insufficient against sophisticated adversarial strategies.
The tau-break benchmark effectively measures agent robustness.
Abstract
Task-oriented LLM-based agents are increasingly used in domains with strict policies, such as refund eligibility or cancellation rules. The challenge lies in ensuring that the agent consistently adheres to these rules and policies, appropriately refusing any request that would violate them, while still maintaining a helpful and natural interaction. This calls for the development of tailored design and evaluation methodologies to ensure agent resilience against malicious user behavior. We propose a novel threat model that focuses on adversarial users aiming to exploit policy-adherent agents for personal benefit. To address this, we present CRAFT, a multi-agent red-teaming system that leverages policy-aware persuasive strategies to undermine a policy-adherent agent in a customer-service scenario, outperforming conventional jailbreak methods such as DAN prompts, emotional manipulation, and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsAccess Control and Trust · Security and Verification in Computing · Advanced Malware Detection Techniques
