SORRY-Bench: Systematically Evaluating Large Language Model Safety Refusal
Tinghao Xie, Xiangyu Qi, Yi Zeng, Yangsibo Huang, Udari Madhushani, Sehwag, Kaixuan Huang, Luxi He, Boyi Wei, Dacheng Li, Ying Sheng, Ruoxi Jia,, Bo Li, Kai Li, Danqi Chen, Peter Henderson, Prateek Mittal

TL;DR
SORRY-Bench introduces a comprehensive, fine-grained, and linguistically diverse benchmark for evaluating large language models' safety refusal capabilities, addressing limitations of previous coarse evaluations and computational costs.
Contribution
It provides a detailed safety evaluation framework with a fine-grained unsafe topic taxonomy, linguistic augmentations, and an efficient automated evaluation method using fine-tuned small LLMs.
Findings
Fine-tuned 7B LLMs achieve GPT-4 level accuracy in safety refusal evaluation.
Over 50 LLMs were systematically analyzed using SORRY-Bench.
The benchmark reveals diverse safety refusal behaviors across models.
Abstract
Evaluating aligned large language models' (LLMs) ability to recognize and reject unsafe user requests is crucial for safe, policy-compliant deployments. Existing evaluation efforts, however, face three limitations that we address with SORRY-Bench, our proposed benchmark. First, existing methods often use coarse-grained taxonomies of unsafe topics, and are over-representing some fine-grained topics. For example, among the ten existing datasets that we evaluated, tests for refusals of self-harm instructions are over 3x less represented than tests for fraudulent activities. SORRY-Bench improves on this by using a fine-grained taxonomy of 44 potentially unsafe topics, and 440 class-balanced unsafe instructions, compiled through human-in-the-loop methods. Second, linguistic characteristics and formatting of prompts are often overlooked, like different languages, dialects, and more -- which…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗sorry-bench/ft-mistral-7b-instruct-v0.2-sorry-bench-202406model· 5.0k dl· ♡ 75.0k dl♡ 7
- 🤗RichardErkhov/sorry-bench_-_ft-mistral-7b-instruct-v0.2-sorry-bench-202406-ggufmodel· 12 dl12 dl
- 🤗SentientAGI/Dobby-Mini-Leashed-Llama-3.1-8Bmodel· 24 dl· ♡ 1224 dl♡ 12
- 🤗SentientAGI/Dobby-Mini-Unhinged-Llama-3.1-8Bmodel· 26 dl· ♡ 4726 dl♡ 47
- 🤗SentientAGI/Dobby-Unhinged-Llama-3.3-70Bmodel· 11 dl· ♡ 4511 dl♡ 45
- 🤗SentientAGI/Dobby-Mini-Unhinged-Plus-Llama-3.1-8Bmodel· 20 dl· ♡ 1820 dl♡ 18
Videos
Taxonomy
TopicsSoftware Reliability and Analysis Research · Natural Language Processing Techniques · Topic Modeling
MethodsAttention Is All You Need · Softmax · Layer Normalization · Absolute Position Encodings · Byte Pair Encoding · Label Smoothing · Position-Wise Feed-Forward Layer · Dropout · Adam · Linear Layer
