Summon a Demon and Bind it: A Grounded Theory of LLM Red Teaming
Nanna Inie, Jonathan Stray, Leon Derczynski

TL;DR
This paper develops a grounded theory of LLM red teaming, exploring why and how practitioners intentionally generate abnormal outputs from large language models through diverse attack strategies.
Contribution
It provides the first comprehensive qualitative analysis defining LLM red teaming, its motivations, and a taxonomy of attack strategies and techniques.
Findings
LLM red teaming is a limit-seeking, non-malicious, team-based activity.
Practitioners are motivated by curiosity, fun, and harm concerns.
A taxonomy of 12 strategies and 35 techniques of attacking LLMs is presented.
Abstract
Engaging in the deliberate generation of abnormal outputs from Large Language Models (LLMs) by attacking them is a novel human activity. This paper presents a thorough exposition of how and why people perform such attacks, defining LLM red-teaming based on extensive and diverse evidence. Using a formal qualitative methodology, we interviewed dozens of practitioners from a broad range of backgrounds, all contributors to this novel work of attempting to cause LLMs to fail. We focused on the research questions of defining LLM red teaming, uncovering the motivations and goals for performing the activity, and characterizing the strategies people use when attacking LLMs. Based on the data, LLM red teaming is defined as a limit-seeking, non-malicious, manual activity, which depends highly on a team-effort and an alchemist mindset. It is highly intrinsically motivated by curiosity, fun, and to…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques
