FERRET: Framework for Expansion Reliant Red Teaming

Ninareh Mehrabi; Vitor Albiero; Maya Pavlova; Joanna Bitton

arXiv:2603.10010·cs.CL·March 12, 2026

FERRET: Framework for Expansion Reliant Red Teaming

Ninareh Mehrabi, Vitor Albiero, Maya Pavlova, Joanna Bitton

PDF

Open Access

TL;DR

FERRET is a comprehensive automated red teaming framework that enhances adversarial multi-modal conversations through horizontal, vertical, and meta expansions, outperforming existing methods.

Contribution

The paper introduces FERRET, a novel multi-faceted framework that systematically improves adversarial conversation generation via multiple expansion strategies.

Findings

01

FERRET generates more effective adversarial conversations.

02

FERRET outperforms existing red teaming approaches.

03

Multi-modal attack strategies are effectively discovered during conversations.

Abstract

We introduce a multi-faceted automated red teaming framework in which the goal is to generate multi-modal adversarial conversations that would break a target model and introduce various expansions that would result in more effective and efficient adversarial conversations. The introduced expansions include: 1. Horizontal expansion in which the goal is for the red team model to self-improve and generate more effective conversation starters that would shape a conversation. 2. Vertical expansion in which the goal is to take these conversation starters that are discovered in the horizontal expansion phase and expand them into effective multi-modal conversations and 3. Meta expansion in which the goal is for the red team model to discover more effective multi-modal attack strategies during the course of a conversation. We call our framework FERRET (Framework for Expansion Reliant Red…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdversarial Robustness in Machine Learning · Hate Speech and Cyberbullying Detection · Multimodal Machine Learning Applications