Evaluating GPT-3 Generated Explanations for Hateful Content Moderation

Han Wang; Ming Shan Hee; Md Rabiul Awal; Kenny Tsu Wei Choo; Roy; Ka-Wei Lee

arXiv:2305.17680·cs.CL·August 31, 2023·2 cites

Evaluating GPT-3 Generated Explanations for Hateful Content Moderation

Han Wang, Ming Shan Hee, Md Rabiul Awal, Kenny Tsu Wei Choo, Roy, Ka-Wei Lee

PDF

Open Access 1 Repo

TL;DR

This study evaluates GPT-3 generated explanations for hate speech, revealing high linguistic quality but potential risks of misleading judgments, emphasizing cautious use in content moderation.

Contribution

It introduces an analytical framework and extensive survey to assess GPT-3 explanations for hate speech, highlighting their strengths and limitations.

Findings

01

GPT-3 explanations are linguistically fluent and informative.

02

Persuasiveness varies with prompting strategy.

03

Potential to mislead judgments about hatefulness.

Abstract

Recent research has focused on using large language models (LLMs) to generate explanations for hate speech through fine-tuning or prompting. Despite the growing interest in this area, these generated explanations' effectiveness and potential limitations remain poorly understood. A key concern is that these explanations, generated by LLMs, may lead to erroneous judgments about the nature of flagged content by both users and content moderators. For instance, an LLM-generated explanation might inaccurately convince a content moderator that a benign piece of content is hateful. In light of this, we propose an analytical framework for examining hate speech explanations and conducted an extensive survey on evaluating such explanations. Specifically, we prompted GPT-3 to generate explanations for both hateful and non-hateful content, and a survey was conducted with 2,400 unique respondents to…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

social-ai-studio/gpt3-hateeval
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHate Speech and Cyberbullying Detection

MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · 15 Ways to Contact How can i speak to someone at Delta Airlines · Multi-Head Attention · Attention Is All You Need · Adam · Dense Connections · Weight Decay · {Dispute@FaQ-s}How to file a dispute with Expedia? · Cosine Annealing · Attention Dropout