Decoding Hate: Exploring Language Models' Reactions to Hate Speech

Paloma Piot; Javier Parapar

arXiv:2410.00775·cs.CL·June 10, 2025

Decoding Hate: Exploring Language Models' Reactions to Hate Speech

Paloma Piot, Javier Parapar

PDF

Open Access 2 Models 1 Video

TL;DR

This study examines how seven advanced language models respond to hate speech, analyzing their reactions and discussing mitigation strategies to prevent hate speech generation in AI systems.

Contribution

It provides a comparative analysis of LLMs' responses to hate speech and explores effective mitigation techniques like fine-tuning and guideline guardrailing.

Findings

01

Models show varied responses to hate speech inputs.

02

Fine-tuning and guardrailing can reduce hate speech generation.

03

Models respond differently to politically correct framing.

Abstract

Hate speech is a harmful form of online expression, often manifesting as derogatory posts. It is a significant risk in digital environments. With the rise of Large Language Models (LLMs), there is concern about their potential to replicate hate speech patterns, given their training on vast amounts of unmoderated internet data. Understanding how LLMs respond to hate speech is crucial for their responsible deployment. However, the behaviour of LLMs towards hate speech has been limited compared. This paper investigates the reactions of seven state-of-the-art LLMs (LLaMA 2, Vicuna, LLaMA 3, Mistral, GPT-3.5, GPT-4, and Gemini Pro) to hate speech. Through qualitative analysis, we aim to reveal the spectrum of responses these models produce, highlighting their capacity to handle hate speech inputs. We also discuss strategies to mitigate hate speech generation by LLMs, particularly through…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

Videos

Decoding Hate: Exploring Language Models' Reactions to Hate Speech· underline

Taxonomy

TopicsHate Speech and Cyberbullying Detection

MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · 15 Ways to Contact How can i speak to someone at Delta Airlines · Cosine Annealing · Layer Normalization · Attention Is All You Need · Linear Warmup With Cosine Annealing · Adam · Linear Layer · Residual Connection · Weight Decay