SimpleSafetyTests: a Test Suite for Identifying Critical Safety Risks in   Large Language Models

Bertie Vidgen; Nino Scherrer; Hannah Rose Kirk; Rebecca Qian; Anand; Kannappan; Scott A. Hale; Paul R\"ottger

arXiv:2311.08370·cs.CL·February 19, 2024·5 cites

SimpleSafetyTests: a Test Suite for Identifying Critical Safety Risks in Large Language Models

Bertie Vidgen, Nino Scherrer, Hannah Rose Kirk, Rebecca Qian, Anand, Kannappan, Scott A. Hale, Paul R\"ottger

PDF

Open Access 2 Datasets

TL;DR

This paper introduces SimpleSafetyTests, a comprehensive test suite for systematically identifying safety risks in large language models, revealing significant safety weaknesses across various models and harm areas.

Contribution

The paper presents a new test suite, SimpleSafetyTests, for evaluating safety risks in LLMs and assesses the effectiveness of safety filters and mitigation strategies.

Findings

01

Most models respond unsafely to over 20% of prompts

02

Prepending safety prompts reduces unsafe responses but doesn't eliminate them

03

Safety filter performance varies significantly across models and harm areas

Abstract

The past year has seen rapid acceleration in the development of large language models (LLMs). However, without proper steering and safeguards, LLMs will readily follow malicious instructions, provide unsafe advice, and generate toxic content. We introduce SimpleSafetyTests (SST) as a new test suite for rapidly and systematically identifying such critical safety risks. The test suite comprises 100 test prompts across five harm areas that LLMs, for the vast majority of applications, should refuse to comply with. We test 11 open-access and open-source LLMs and four closed-source LLMs, and find critical safety weaknesses. While some of the models do not give a single unsafe response, most give unsafe responses to more than 20% of the prompts, with over 50% unsafe responses in the extreme. Prepending a safety-emphasising system prompt substantially reduces the occurrence of unsafe responses,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Artificial Intelligence in Healthcare and Education

MethodsAttention Is All You Need · Linear Layer · Dense Connections · Label Smoothing · Adam · Softmax · Multi-Head Attention · Layer Normalization · Residual Connection · Dropout