TL;DR
SIRAJ is a comprehensive red-teaming framework for LLM agents that generates diverse risk scenarios, refines adversarial attacks iteratively, and employs model distillation to create efficient, high-performing smaller red-teaming models.
Contribution
The paper introduces a novel dynamic two-step red-teaming process with structured reasoning and a distillation approach for efficient, effective LLM safety testing.
Findings
Seed test case generation increases risk coverage by 2-2.5x.
Distilled 8B red-teamer improves attack success rate by 100%.
Framework effectively generalizes across diverse LLM settings.
Abstract
The ability of LLM agents to plan and invoke tools exposes them to new safety risks, making a comprehensive red-teaming system crucial for discovering vulnerabilities and ensuring their safe deployment. We present SIRAJ: a generic red-teaming framework for arbitrary black-box LLM agents. We employ a dynamic two-step process that starts with an agent definition and generates diverse seed test cases that cover various risk outcomes, tool-use trajectories, and risk sources. Then, it iteratively constructs and refines model-based adversarial attacks based on the execution trajectories of former attempts. To optimize the red-teaming cost, we present a model distillation approach that leverages structured forms of a teacher model's reasoning to train smaller models that are equally effective. Across diverse evaluation agent settings, our seed test case generation approach yields 2 -- 2.5x…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
