Just Say No: Analyzing the Stance of Neural Dialogue Generation in   Offensive Contexts

Ashutosh Baheti; Maarten Sap; Alan Ritter; Mark Riedl

arXiv:2108.11830·cs.CL·September 14, 2021

Just Say No: Analyzing the Stance of Neural Dialogue Generation in Offensive Contexts

Ashutosh Baheti, Maarten Sap, Alan Ritter, Mark Riedl

PDF

Open Access 1 Repo

TL;DR

This paper investigates the tendency of neural dialogue models to produce offensive responses, introduces a dataset for analysis, and evaluates methods to reduce offensive output, aiming to improve dialogue safety.

Contribution

The paper presents ToxiChat, a new dataset for analyzing offensive language in dialogue, and evaluates controllable text generation techniques to mitigate offensive responses in neural models.

Findings

01

42% of human responses agree with toxic comments

02

Neural models like DialoGPT are twice as likely to agree with offensive comments

03

Fine-tuned classifiers achieve 0.71 F1 on offensive detection

Abstract

Dialogue models trained on human conversations inadvertently learn to generate toxic responses. In addition to producing explicitly offensive utterances, these models can also implicitly insult a group or individual by aligning themselves with an offensive statement. To better understand the dynamics of contextually offensive language, we investigate the stance of dialogue model responses in offensive Reddit conversations. Specifically, we create ToxiChat, a crowd-annotated dataset of 2,000 Reddit threads and model responses labeled with offensive language and stance. Our analysis reveals that 42% of human responses agree with toxic comments, whereas only 13% agree with safe comments. This undesirable behavior is learned by neural dialogue models, such as DialoGPT, which we show are two times more likely to agree with offensive comments. To enable automatic detection of offensive…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

abaheti95/toxichat
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHate Speech and Cyberbullying Detection · Advanced Malware Detection Techniques