XSTest: A Test Suite for Identifying Exaggerated Safety Behaviours in Large Language Models
Paul R\"ottger, Hannah Rose Kirk, Bertie Vidgen, Giuseppe Attanasio,, Federico Bianchi, Dirk Hovy

TL;DR
XSTest is a comprehensive test suite designed to identify exaggerated safety behaviors in large language models, revealing systematic failures in model safety and helpfulness balance.
Contribution
The paper introduces XSTest, a novel test suite with 450 prompts to systematically evaluate safety behaviors in large language models.
Findings
Models often refuse safe prompts due to safety over-correction.
XSTest uncovers systematic safety failure modes.
Challenges in balancing helpfulness and harmlessness are highlighted.
Abstract
Without proper safeguards, large language models will readily follow malicious instructions and generate toxic content. This risk motivates safety efforts such as red-teaming and large-scale feedback learning, which aim to make models both helpful and harmless. However, there is a tension between these two objectives, since harmlessness requires models to refuse to comply with unsafe prompts, and thus not be helpful. Recent anecdotal evidence suggests that some models may have struck a poor balance, so that even clearly safe prompts are refused if they use similar language to unsafe prompts or mention sensitive topics. In this paper, we introduce a new test suite called XSTest to identify such eXaggerated Safety behaviours in a systematic way. XSTest comprises 250 safe prompts across ten prompt types that well-calibrated models should not refuse to comply with, and 200 unsafe prompts as…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗meta-llama/Llama-Guard-3-8Bmodel· 83k dl· ♡ 28383k dl♡ 283
- 🤗meta-llama/Llama-Guard-3-8B-INT8model· 8.6k dl· ♡ 388.6k dl♡ 38
- 🤗QuantFactory/Llama-Guard-3-8B-GGUFmodel· 363 dl· ♡ 2363 dl♡ 2
- 🤗Najii/Llama-Guardmodel
- 🤗Najii/Llama-Guard-3-8B-INT8model
- 🤗meta-llama/Llama-Guard-3-1Bmodel· 63k dl· ♡ 10363k dl♡ 103
- 🤗meta-llama/Llama-Guard-3-1B-INT4model· 30 dl· ♡ 2730 dl♡ 27
- 🤗QuantFactory/Llama-Guard-3-1B-GGUFmodel· 462 dl· ♡ 7462 dl♡ 7
- 🤗alpindale/Llama-Guard-3-1Bmodel· 454 dl· ♡ 2454 dl♡ 2
- 🤗alpindale/Llama-Guard-3-1B-INT4model· 9 dl9 dl
Videos
Taxonomy
TopicsSoftware Engineering Research · Software Reliability and Analysis Research · Adversarial Robustness in Machine Learning
