Navigating the Sea of LLM Evaluation: Investigating Bias in Toxicity Benchmarks
Regina Gugg, Selina Niederl\"ander, Andreas St\"ockl, Martin Flechl

TL;DR
This paper critically examines toxicity benchmarks for large language models, revealing significant biases and inconsistencies that impact the reliability of safety evaluations.
Contribution
It systematically investigates intrinsic biases in toxicity benchmarks, highlighting their effects on evaluation robustness and proposing the need for improved safety assessment methods.
Findings
Benchmark behaviors vary significantly with task changes.
Input data domain shifts affect benchmark consistency.
Model-specific instabilities highlight evaluation vulnerabilities.
Abstract
The rapid adoption of LLMs in both research and industry highlights the challenges of deploying them safely and reveals a gap in the systematic evaluation of toxicity benchmarks. As organizations increasingly rely on these benchmarks to certify models for customer-facing applications and automated moderation, unrecognized evaluation biases could lead to the deployment of vulnerable or unsafe systems. This work investigates the robustness of established benchmarking setups and examines how to measure currently neglected intrinsic biases, such as those related to model choice, metrics, and task types. Our experiments uncover significant discrepancies in benchmark behaviors when evaluation setups are altered. Specifically, shifting the task from text completion to summarization increases the tendency of benchmarks to flag content as harmful. Additionally, certain benchmarks fail to…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
