FERRET: Framework for Expansion Reliant Red Teaming
Ninareh Mehrabi, Vitor Albiero, Maya Pavlova, Joanna Bitton

TL;DR
FERRET is a comprehensive automated red teaming framework that enhances adversarial multi-modal conversations through horizontal, vertical, and meta expansions, outperforming existing methods.
Contribution
The paper introduces FERRET, a novel multi-faceted framework that systematically improves adversarial conversation generation via multiple expansion strategies.
Findings
FERRET generates more effective adversarial conversations.
FERRET outperforms existing red teaming approaches.
Multi-modal attack strategies are effectively discovered during conversations.
Abstract
We introduce a multi-faceted automated red teaming framework in which the goal is to generate multi-modal adversarial conversations that would break a target model and introduce various expansions that would result in more effective and efficient adversarial conversations. The introduced expansions include: 1. Horizontal expansion in which the goal is for the red team model to self-improve and generate more effective conversation starters that would shape a conversation. 2. Vertical expansion in which the goal is to take these conversation starters that are discovered in the horizontal expansion phase and expand them into effective multi-modal conversations and 3. Meta expansion in which the goal is for the red team model to discover more effective multi-modal attack strategies during the course of a conversation. We call our framework FERRET (Framework for Expansion Reliant Red…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdversarial Robustness in Machine Learning · Hate Speech and Cyberbullying Detection · Multimodal Machine Learning Applications
