TL;DR
AgenticRed introduces an automated, evolutionary approach using LLMs to design and refine red-teaming systems, significantly improving attack success rates across multiple models without human intervention.
Contribution
It presents a novel automated pipeline that evolves red-teaming systems with minimal human input, outperforming existing methods in exposing model vulnerabilities.
Findings
Achieves up to 100% attack success rate on various models.
Outperforms state-of-the-art approaches in red-teaming effectiveness.
Generates transfer-robust red-teaming systems applicable to proprietary models.
Abstract
While recent automated red-teaming methods show promise for systematically exposing model vulnerabilities, most existing approaches rely on human-specified workflows. This dependence on manually designed workflows suffers from human biases and makes exploring the broader design space expensive. We introduce AgenticRed, an automated pipeline that leverages LLMs' in-context learning to iteratively design and refine red-teaming systems without human intervention. Rather than optimizing attacker policies within predefined structures, AgenticRed treats red-teaming as a system design problem, and it autonomously evolves automated red-teaming systems using evolutionary selection and generational knowledge. Red-teaming systems designed by AgenticRed consistently outperform state-of-the-art approaches, achieving 96% attack success rate (ASR) on Llama-2-7B, 98% on Llama-3-8B and 100% on Qwen3-8B…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
