XSTest: A Test Suite for Identifying Exaggerated Safety Behaviours in   Large Language Models

Paul R\"ottger; Hannah Rose Kirk; Bertie Vidgen; Giuseppe Attanasio,; Federico Bianchi; Dirk Hovy

arXiv:2308.01263·cs.CL·April 2, 2024·6 cites

XSTest: A Test Suite for Identifying Exaggerated Safety Behaviours in Large Language Models

Paul R\"ottger, Hannah Rose Kirk, Bertie Vidgen, Giuseppe Attanasio,, Federico Bianchi, Dirk Hovy

PDF

Open Access 1 Repo 10 Models 5 Datasets 1 Video

TL;DR

XSTest is a comprehensive test suite designed to identify exaggerated safety behaviors in large language models, revealing systematic failures in model safety and helpfulness balance.

Contribution

The paper introduces XSTest, a novel test suite with 450 prompts to systematically evaluate safety behaviors in large language models.

Findings

01

Models often refuse safe prompts due to safety over-correction.

02

XSTest uncovers systematic safety failure modes.

03

Challenges in balancing helpfulness and harmlessness are highlighted.

Abstract

Without proper safeguards, large language models will readily follow malicious instructions and generate toxic content. This risk motivates safety efforts such as red-teaming and large-scale feedback learning, which aim to make models both helpful and harmless. However, there is a tension between these two objectives, since harmlessness requires models to refuse to comply with unsafe prompts, and thus not be helpful. Recent anecdotal evidence suggests that some models may have struck a poor balance, so that even clearly safe prompts are refused if they use similar language to unsafe prompts or mention sensitive topics. In this paper, we introduce a new test suite called XSTest to identify such eXaggerated Safety behaviours in a systematic way. XSTest comprises 250 safe prompts across ten prompt types that well-calibrated models should not refuse to comply with, and 200 unsafe prompts as…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

paul-rottger/exaggerated-safety
noneOfficial

Models

Datasets

Videos

XSTest: A Test Suite for Identifying Exaggerated Safety Behaviours in Large Language Models· underline

Taxonomy

TopicsSoftware Engineering Research · Software Reliability and Analysis Research · Adversarial Robustness in Machine Learning