Adversarial Prompt Evaluation: Systematic Benchmarking of Guardrails   Against Prompt Input Attacks on LLMs

Giulio Zizzo; Giandomenico Cornacchia; Kieran Fraser; Muhammad Zaid; Hameed; Ambrish Rawat; Beat Buesser; Mark Purcell; Pin-Yu Chen; Prasanna; Sattigeri; Kush Varshney

arXiv:2502.15427·cs.CR·February 24, 2025

Adversarial Prompt Evaluation: Systematic Benchmarking of Guardrails Against Prompt Input Attacks on LLMs

Giulio Zizzo, Giandomenico Cornacchia, Kieran Fraser, Muhammad Zaid, Hameed, Ambrish Rawat, Beat Buesser, Mark Purcell, Pin-Yu Chen, Prasanna, Sattigeri, Kush Varshney

PDF

1 Repo

TL;DR

This paper systematically benchmarks 15 different guardrail defenses against various jailbreak prompts on large language models, revealing significant performance variation and the effectiveness of simple baselines in out-of-distribution scenarios.

Contribution

It provides a comprehensive evaluation framework for LLM safety defenses, highlighting their strengths and weaknesses across diverse attack styles and datasets.

Findings

01

Performance varies significantly by jailbreak style.

02

Simple baselines can be competitive with advanced defenses.

03

Current datasets may not fully capture out-of-distribution attack robustness.

Abstract

As large language models (LLMs) become integrated into everyday applications, ensuring their robustness and security is increasingly critical. In particular, LLMs can be manipulated into unsafe behaviour by prompts known as jailbreaks. The variety of jailbreak styles is growing, necessitating the use of external defences known as guardrails. While many jailbreak defences have been proposed, not all defences are able to handle new out-of-distribution attacks due to the narrow segment of jailbreaks used to align them. Moreover, the lack of systematisation around defences has created significant gaps in their practical application. In this work, we perform systematic benchmarking across 15 different defences, considering a broad swathe of malicious and benign datasets. We find that there is significant performance variation depending on the style of jailbreak a defence is subject to.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

ibm/adversarial-prompt-evaluation
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsALIGN