SORRY-Bench: Systematically Evaluating Large Language Model Safety   Refusal

Tinghao Xie; Xiangyu Qi; Yi Zeng; Yangsibo Huang; Udari Madhushani; Sehwag; Kaixuan Huang; Luxi He; Boyi Wei; Dacheng Li; Ying Sheng; Ruoxi Jia,; Bo Li; Kai Li; Danqi Chen; Peter Henderson; Prateek Mittal

arXiv:2406.14598·cs.AI·March 4, 2025·5 cites

SORRY-Bench: Systematically Evaluating Large Language Model Safety Refusal

Tinghao Xie, Xiangyu Qi, Yi Zeng, Yangsibo Huang, Udari Madhushani, Sehwag, Kaixuan Huang, Luxi He, Boyi Wei, Dacheng Li, Ying Sheng, Ruoxi Jia,, Bo Li, Kai Li, Danqi Chen, Peter Henderson, Prateek Mittal

PDF

Open Access 1 Repo 6 Models 5 Datasets 1 Video

TL;DR

SORRY-Bench introduces a comprehensive, fine-grained, and linguistically diverse benchmark for evaluating large language models' safety refusal capabilities, addressing limitations of previous coarse evaluations and computational costs.

Contribution

It provides a detailed safety evaluation framework with a fine-grained unsafe topic taxonomy, linguistic augmentations, and an efficient automated evaluation method using fine-tuned small LLMs.

Findings

01

Fine-tuned 7B LLMs achieve GPT-4 level accuracy in safety refusal evaluation.

02

Over 50 LLMs were systematically analyzed using SORRY-Bench.

03

The benchmark reveals diverse safety refusal behaviors across models.

Abstract

Evaluating aligned large language models' (LLMs) ability to recognize and reject unsafe user requests is crucial for safe, policy-compliant deployments. Existing evaluation efforts, however, face three limitations that we address with SORRY-Bench, our proposed benchmark. First, existing methods often use coarse-grained taxonomies of unsafe topics, and are over-representing some fine-grained topics. For example, among the ten existing datasets that we evaluated, tests for refusals of self-harm instructions are over 3x less represented than tests for fraudulent activities. SORRY-Bench improves on this by using a fine-grained taxonomy of 44 potentially unsafe topics, and 440 class-balanced unsafe instructions, compiled through human-in-the-loop methods. Second, linguistic characteristics and formatting of prompts are often overlooked, like different languages, dialects, and more -- which…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

princeton-nlp/unintentional-unalignment
pytorch

Models

Datasets

Videos

SORRY-Bench: Systematically Evaluating Large Language Model Safety Refusal· slideslive

Taxonomy

TopicsSoftware Reliability and Analysis Research · Natural Language Processing Techniques · Topic Modeling

MethodsAttention Is All You Need · Softmax · Layer Normalization · Absolute Position Encodings · Byte Pair Encoding · Label Smoothing · Position-Wise Feed-Forward Layer · Dropout · Adam · Linear Layer