SimpleSafetyTests: a Test Suite for Identifying Critical Safety Risks in Large Language Models
Bertie Vidgen, Nino Scherrer, Hannah Rose Kirk, Rebecca Qian, Anand, Kannappan, Scott A. Hale, Paul R\"ottger

TL;DR
This paper introduces SimpleSafetyTests, a comprehensive test suite for systematically identifying safety risks in large language models, revealing significant safety weaknesses across various models and harm areas.
Contribution
The paper presents a new test suite, SimpleSafetyTests, for evaluating safety risks in LLMs and assesses the effectiveness of safety filters and mitigation strategies.
Findings
Most models respond unsafely to over 20% of prompts
Prepending safety prompts reduces unsafe responses but doesn't eliminate them
Safety filter performance varies significantly across models and harm areas
Abstract
The past year has seen rapid acceleration in the development of large language models (LLMs). However, without proper steering and safeguards, LLMs will readily follow malicious instructions, provide unsafe advice, and generate toxic content. We introduce SimpleSafetyTests (SST) as a new test suite for rapidly and systematically identifying such critical safety risks. The test suite comprises 100 test prompts across five harm areas that LLMs, for the vast majority of applications, should refuse to comply with. We test 11 open-access and open-source LLMs and four closed-source LLMs, and find critical safety weaknesses. While some of the models do not give a single unsafe response, most give unsafe responses to more than 20% of the prompts, with over 50% unsafe responses in the extreme. Prepending a safety-emphasising system prompt substantially reduces the occurrence of unsafe responses,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Artificial Intelligence in Healthcare and Education
MethodsAttention Is All You Need · Linear Layer · Dense Connections · Label Smoothing · Adam · Softmax · Multi-Head Attention · Layer Normalization · Residual Connection · Dropout
