Ferret: Faster and Effective Automated Red Teaming with Reward-Based Scoring Technique
Tej Deep Pala, Vernon Y.H. Toh, Rishabh Bhardwaj, Soujanya Poria

TL;DR
Ferret is a novel automated red-teaming method that significantly improves the efficiency and success rate of adversarial prompt generation for large language models by using reward-based scoring and multiple mutations per iteration.
Contribution
Ferret introduces a reward-based scoring technique and multiple adversarial mutations per iteration, enhancing the speed and effectiveness of automated red teaming for LLMs.
Findings
Achieves 95% attack success rate, 46% higher than Rainbow Teaming.
Reduces time to reach 90% ASR by 15.2%.
Generates transferable adversarial prompts effective on larger LLMs.
Abstract
In today's era, where large language models (LLMs) are integrated into numerous real-world applications, ensuring their safety and robustness is crucial for responsible AI usage. Automated red-teaming methods play a key role in this process by generating adversarial attacks to identify and mitigate potential vulnerabilities in these models. However, existing methods often struggle with slow performance, limited categorical diversity, and high resource demands. While Rainbow Teaming, a recent approach, addresses the diversity challenge by framing adversarial prompt generation as a quality-diversity search, it remains slow and requires a large fine-tuned mutator for optimal performance. To overcome these limitations, we propose Ferret, a novel approach that builds upon Rainbow Teaming by generating multiple adversarial prompt mutations per iteration and using a scoring function to rank…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsBehavioral and Psychological Studies · Animal Behavior and Welfare Studies · Insect and Arachnid Ecology and Behavior
MethodsLLaMA
