STOP! Benchmarking Large Language Models with Sensitivity Testing on   Offensive Progressions

Robert Morabito; Sangmitra Madhusudan; Tyler McDonald; Ali Emami

arXiv:2409.13843·cs.CL·February 4, 2025

STOP! Benchmarking Large Language Models with Sensitivity Testing on Offensive Progressions

Robert Morabito, Sangmitra Madhusudan, Tyler McDonald, Ali Emami

PDF

Open Access 1 Repo 1 Video

TL;DR

This paper introduces the STOP dataset for benchmarking large language models' sensitivity to offensive progressions, revealing inconsistent bias detection and demonstrating alignment improvements with human judgments.

Contribution

The paper presents the novel STOP dataset with diverse offensive progressions and evaluates multiple models, highlighting their inconsistent bias detection and potential for improvement through alignment.

Findings

01

Models detect bias with success rates from 19.3% to 69.8%.

02

Aligning models with human judgments improves answer rates by up to 191%.

03

STOP enables comprehensive bias assessment across diverse demographics.

Abstract

Mitigating explicit and implicit biases in Large Language Models (LLMs) has become a critical focus in the field of natural language processing. However, many current methodologies evaluate scenarios in isolation, without considering the broader context or the spectrum of potential biases within each situation. To address this, we introduce the Sensitivity Testing on Offensive Progressions (STOP) dataset, which includes 450 offensive progressions containing 2,700 unique sentences of varying severity that progressively escalate from less to more explicitly offensive. Covering a broad spectrum of 9 demographics and 46 sub-demographics, STOP ensures inclusivity and comprehensive coverage. We evaluate several leading closed- and open-source models, including GPT-4, Mixtral, and Llama 3. Our findings reveal that even the best-performing models detect bias inconsistently, with success rates…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Robert-Morabito/STOP
noneOfficial

Videos

STOP! Benchmarking Large Language Models with Sensitivity Testing on Offensive Progressions· underline

Taxonomy

TopicsTopic Modeling · Adversarial Robustness in Machine Learning · Natural Language Processing Techniques

MethodsAttention Is All You Need · Linear Layer · Position-Wise Feed-Forward Layer · Label Smoothing · Byte Pair Encoding · Absolute Position Encodings · LLaMA · Softmax · Layer Normalization · Dropout