Can LLMs Recognize Toxicity? A Structured Investigation Framework and Toxicity Metric
Hyukhun Koh, Dohyung Kim, Minwoo Lee, and Kyomin Jung

TL;DR
This paper proposes a new, robust toxicity metric based on Large Language Models that addresses limitations of existing methods, providing more accurate toxicity detection aligned with societal standards.
Contribution
It introduces a flexible toxicity measurement framework using LLMs, analyzes toxicity factors, and evaluates the intrinsic attributes of LLMs as evaluators, improving accuracy over traditional metrics.
Findings
The new metric outperforms conventional metrics by 12 F1 points.
Upstream toxicity factors heavily influence downstream evaluation metrics.
LLMs are unsuitable for toxicity evaluation in unverified factors.
Abstract
In the pursuit of developing Large Language Models (LLMs) that adhere to societal standards, it is imperative to detect the toxicity in the generated text. The majority of existing toxicity metrics rely on encoder models trained on specific toxicity datasets, which are susceptible to out-of-distribution (OOD) problems and depend on the dataset's definition of toxicity. In this paper, we introduce a robust metric grounded on LLMs to flexibly measure toxicity according to the given definition. We first analyze the toxicity factors, followed by an examination of the intrinsic toxic attributes of LLMs to ascertain their suitability as evaluators. Finally, we evaluate the performance of our metric with detailed analysis. Our empirical results demonstrate outstanding performance in measuring toxicity within verified factors, improving on conventional metrics by 12 points in the F1 score. Our…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsBiomedical Text Mining and Ontologies · Statistical and Computational Modeling
