STOP! Benchmarking Large Language Models with Sensitivity Testing on Offensive Progressions
Robert Morabito, Sangmitra Madhusudan, Tyler McDonald, Ali Emami

TL;DR
This paper introduces the STOP dataset for benchmarking large language models' sensitivity to offensive progressions, revealing inconsistent bias detection and demonstrating alignment improvements with human judgments.
Contribution
The paper presents the novel STOP dataset with diverse offensive progressions and evaluates multiple models, highlighting their inconsistent bias detection and potential for improvement through alignment.
Findings
Models detect bias with success rates from 19.3% to 69.8%.
Aligning models with human judgments improves answer rates by up to 191%.
STOP enables comprehensive bias assessment across diverse demographics.
Abstract
Mitigating explicit and implicit biases in Large Language Models (LLMs) has become a critical focus in the field of natural language processing. However, many current methodologies evaluate scenarios in isolation, without considering the broader context or the spectrum of potential biases within each situation. To address this, we introduce the Sensitivity Testing on Offensive Progressions (STOP) dataset, which includes 450 offensive progressions containing 2,700 unique sentences of varying severity that progressively escalate from less to more explicitly offensive. Covering a broad spectrum of 9 demographics and 46 sub-demographics, STOP ensures inclusivity and comprehensive coverage. We evaluate several leading closed- and open-source models, including GPT-4, Mixtral, and Llama 3. Our findings reveal that even the best-performing models detect bias inconsistently, with success rates…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsTopic Modeling · Adversarial Robustness in Machine Learning · Natural Language Processing Techniques
MethodsAttention Is All You Need · Linear Layer · Position-Wise Feed-Forward Layer · Label Smoothing · Byte Pair Encoding · Absolute Position Encodings · LLaMA · Softmax · Layer Normalization · Dropout
