Improving Generalizability in Implicitly Abusive Language Detection with   Concept Activation Vectors

Isar Nejadgholi; Kathleen C. Fraser; Svetlana Kiritchenko

arXiv:2204.02261·cs.CL·April 6, 2022·1 cites

Improving Generalizability in Implicitly Abusive Language Detection with Concept Activation Vectors

Isar Nejadgholi, Kathleen C. Fraser, Svetlana Kiritchenko

PDF

Open Access 1 Repo

TL;DR

This paper enhances implicit abusive language detection by using Concept Activation Vectors to interpret model sensitivity and introduces a new metric to identify and incorporate implicit abuse examples, improving generalizability.

Contribution

It applies TCAV-based interpretability to abusive language detection and proposes the Degree of Explicitness metric to improve model robustness against implicit abuse.

Findings

01

Explicit abuse detection is reliable out-of-domain.

02

Implicit abuse detection remains challenging.

03

Degree of Explicitness helps select informative training examples.

Abstract

Robustness of machine learning models on ever-changing real-world data is critical, especially for applications affecting human well-being such as content moderation. New kinds of abusive language continually emerge in online discussions in response to current events (e.g., COVID-19), and the deployed abuse detection systems should be updated regularly to remain accurate. In this paper, we show that general abusive language classifiers tend to be fairly reliable in detecting out-of-domain explicitly abusive utterances but fail to detect new types of more subtle, implicit abuse. Next, we propose an interpretability technique, based on the Testing Concept Activation Vector (TCAV) method from computer vision, to quantify the sensitivity of a trained model to the human-defined concepts of explicit and implicit abusive language, and use that to explain the generalizability of the model on…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

isarnejad/tcav-for-text-classifiers
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHate Speech and Cyberbullying Detection