Tensor Trust: Interpretable Prompt Injection Attacks from an Online Game
Sam Toyer, Olivia Watkins, Ethan Adrian Mendes, Justin Svegliato, Luke, Bailey, Tiffany Wang, Isaac Ong, Karim Elmaaroufi, Pieter Abbeel, Trevor, Darrell, Alan Ritter, Stuart Russell

TL;DR
This paper introduces a large dataset of human-generated prompt injection attacks and defenses from an online game, providing insights into LLM vulnerabilities and establishing a benchmark for evaluating resistance to such attacks.
Contribution
It presents the largest dataset of adversarial prompt injections created by humans and develops a benchmark for assessing LLM robustness against these attacks.
Findings
Many models are vulnerable to prompt injection strategies.
Some attack strategies generalize to real-world LLM applications.
The dataset reveals interpretable weaknesses in LLMs.
Abstract
While Large Language Models (LLMs) are increasingly being used in real-world applications, they remain vulnerable to prompt injection attacks: malicious third party prompts that subvert the intent of the system designer. To help researchers study this problem, we present a dataset of over 126,000 prompt injection attacks and 46,000 prompt-based "defenses" against prompt injection, all created by players of an online game called Tensor Trust. To the best of our knowledge, this is currently the largest dataset of human-generated adversarial examples for instruction-following LLMs. The attacks in our dataset have a lot of easily interpretable stucture, and shed light on the weaknesses of LLMs. We also use the dataset to create a benchmark for resistance to two types of prompt injection, which we refer to as prompt extraction and prompt hijacking. Our benchmark results show that many models…
Peer Reviews
Decision·ICLR 2024 spotlight
* The use of an online game to collect human-generated adversarial examples for instruction-following LLMs is a novel and creative way to understand the weaknesses of these models. * The creation of benchmarks for evaluating LLM resistance to prompt injection attacks is a valuable contribution, as it provides a standardized way to assess the security of these models. This is of practical importance in the development of secure LLM-based applications.
* The paper's setting is initially hard to grasp; authors should aim to explain the threat model clearly, using notations and specifying the level of access attackers have. * The paper should establish a connection to Textual Backdoor attacks, even though these attacks typically require a more significant level of access to LLMs or their pretraining data than the setting the authors are primarily interested in. This additional context would help improve the clear understanding of the threat mod
[1] The proposed dataset has been released publicly. [2] The samples in the dataset are high-quality since they are devised manually.
[1] Although the size of the dataset is pretty large, the mechanism of this game is monotonous. [2] I am unsure whether the topic of this paper aligns with the theme of this conference.
The game Trust Tensor is well-designed and has the potential to serve as a benchmark for evaluating the adversarial robustness of Large Language Models.
The task of prompt extraction is distinct from the Trust Tensor. In other words, the Trust Tensor is not very suitable for collecting the prompt extraction dataset.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdversarial Robustness in Machine Learning · Topic Modeling · Software Engineering Research
MethodsSparse Evolutionary Training
