Gandalf the Red: Adaptive Security for LLMs

Niklas Pfister; V\'aclav Volhejn; Manuel Knott; Santiago Arias; Julia Bazi\'nska; Mykhailo Bichurin; Alan Commike; Janet Darling; Peter Dienes; Matthew Fiedler; David Haber; Matthias Kraft; Marco Lancini; Max Mathys; Dami\'an Pascual-Ortiz; Jakub Podolak; Adri\`a Romero-L\'opez; Kyriacos Shiarlis; Andreas Signer; Zsolt Terek; Athanasios Theocharis; Daniel Timbrell; Samuel Trautwein; Samuel Watts; Yun-Han Wu; Mateo Rojas-Carulla

arXiv:2501.07927·cs.LG·August 5, 2025

Gandalf the Red: Adaptive Security for LLMs

Niklas Pfister, V\'aclav Volhejn, Manuel Knott, Santiago Arias, Julia Bazi\'nska, Mykhailo Bichurin, Alan Commike, Janet Darling, Peter Dienes, Matthew Fiedler, David Haber, Matthias Kraft, Marco Lancini, Max Mathys, Dami\'an Pascual-Ortiz, Jakub Podolak, Adri\`a Romero-L\'opez

PDF

Open Access 1 Repo 3 Datasets 1 Video

TL;DR

This paper introduces D-SEC, a model for adaptive LLM security that considers attacker-legitimate user dynamics, and Gandalf, a gamified platform for generating realistic prompt attacks, revealing trade-offs between security and usability.

Contribution

It presents D-SEC for modeling adaptive threats and introduces Gandalf, a large dataset of prompt attacks, to improve evaluation of LLM defenses.

Findings

01

Defenses can degrade usability even without request blocking.

02

Adaptive defenses and defense-in-depth improve security without harming utility.

03

Realistic attack data helps evaluate and enhance LLM security strategies.

Abstract

Current evaluations of defenses against prompt attacks in large language model (LLM) applications often overlook two critical factors: the dynamic nature of adversarial behavior and the usability penalties imposed on legitimate users by restrictive defenses. We propose D-SEC (Dynamic Security Utility Threat Model), which explicitly separates attackers from legitimate users, models multi-step interactions, and expresses the security-utility in an optimizable form. We further address the shortcomings in existing evaluations by introducing Gandalf, a crowd-sourced, gamified red-teaming platform designed to generate realistic, adaptive attack. Using Gandalf, we collect and release a dataset of 279k prompt attacks. Complemented by benign user data, our analysis reveals the interplay between security and utility, showing that defenses integrated in the LLM (e.g., system prompts) can degrade…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

lakeraai/dsec-gandalf
noneOfficial

Datasets

Videos

Gandalf the Red: Adaptive Security for LLMs· slideslive

Taxonomy

TopicsDigital Rights Management and Security