Summon a Demon and Bind it: A Grounded Theory of LLM Red Teaming

Nanna Inie; Jonathan Stray; Leon Derczynski

arXiv:2311.06237·cs.CL·December 12, 2024·2 cites

Summon a Demon and Bind it: A Grounded Theory of LLM Red Teaming

Nanna Inie, Jonathan Stray, Leon Derczynski

PDF

Open Access

TL;DR

This paper develops a grounded theory of LLM red teaming, exploring why and how practitioners intentionally generate abnormal outputs from large language models through diverse attack strategies.

Contribution

It provides the first comprehensive qualitative analysis defining LLM red teaming, its motivations, and a taxonomy of attack strategies and techniques.

Findings

01

LLM red teaming is a limit-seeking, non-malicious, team-based activity.

02

Practitioners are motivated by curiosity, fun, and harm concerns.

03

A taxonomy of 12 strategies and 35 techniques of attacking LLMs is presented.

Abstract

Engaging in the deliberate generation of abnormal outputs from Large Language Models (LLMs) by attacking them is a novel human activity. This paper presents a thorough exposition of how and why people perform such attacks, defining LLM red-teaming based on extensive and diverse evidence. Using a formal qualitative methodology, we interviewed dozens of practitioners from a broad range of backgrounds, all contributors to this novel work of attempting to cause LLMs to fail. We focused on the research questions of defining LLM red teaming, uncovering the motivations and goals for performing the activity, and characterizing the strategies people use when attacking LLMs. Based on the data, LLM red teaming is defined as a limit-seeking, non-malicious, manual activity, which depends highly on a team-effort and an alchemist mindset. It is highly intrinsically motivated by curiosity, fun, and to…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques