Against The Achilles' Heel: A Survey on Red Teaming for Generative   Models

Lizhi Lin; Honglin Mu; Zenan Zhai; Minghan Wang; Yuxia; Wang; Renxi Wang; Junjie Gao; Yixuan Zhang; Wanxiang Che and; Timothy Baldwin; Xudong Han; Haonan Li

arXiv:2404.00629·cs.CL·November 27, 2024·2 cites

Against The Achilles' Heel: A Survey on Red Teaming for Generative Models

Lizhi Lin, Honglin Mu, Zenan Zhai, Minghan Wang, Yuxia, Wang, Renxi Wang, Junjie Gao, Yixuan Zhang, Wanxiang Che and, Timothy Baldwin, Xudong Han, Haonan Li

PDF

Open Access 1 Repo 1 Datasets

TL;DR

This survey comprehensively reviews red teaming strategies for generative models, introducing a taxonomy and framework, and exploring emerging topics like multimodal attacks, LLM-based agent risks, and safety-harm trade-offs.

Contribution

It provides a detailed taxonomy of attack strategies, the unified 'searcher' framework, and covers new research areas in generative model safety and robustness.

Findings

01

Developed a taxonomy of attack strategies based on language model capabilities.

02

Created the 'searcher' framework to unify automatic red teaming methods.

03

Explored emerging topics such as multimodal attacks and safety-harm balance.

Abstract

Generative models are rapidly gaining popularity and being integrated into everyday applications, raising concerns over their safe use as various vulnerabilities are exposed. In light of this, the field of red teaming is undergoing fast-paced growth, highlighting the need for a comprehensive survey covering the entire pipeline and addressing emerging topics. Our extensive survey, which examines over 120 papers, introduces a taxonomy of fine-grained attack strategies grounded in the inherent capabilities of language models. Additionally, we have developed the "searcher" framework to unify various automatic red teaming approaches. Moreover, our survey covers novel areas including multimodal attacks and defenses, risks around LLM-based agents, overkill of harmless queries, and the balance between harmlessness and helpfulness.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Libr-AI/OpenRedTeaming
noneOfficial

Datasets

BAAI/SurveyScope
dataset· 6 dl
6 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSemantic Web and Ontologies