DefVerify: Do Hate Speech Models Reflect Their Dataset's Definition?

Urja Khurana; Eric Nalisnick; Antske Fokkens

arXiv:2410.15911·cs.CL·January 14, 2025

DefVerify: Do Hate Speech Models Reflect Their Dataset's Definition?

Urja Khurana, Eric Nalisnick, Antske Fokkens

PDF

Open Access 1 Repo

TL;DR

DefVerify is a method that assesses whether hate speech detection models accurately reflect their intended definitions by analyzing datasets and model behavior, identifying gaps and failure points in the workflow.

Contribution

It introduces a three-step procedure to encode, quantify, and diagnose the alignment between hate speech definitions and model behavior.

Findings

01

Identified gaps between dataset annotations and user-defined hate speech concepts.

02

Demonstrated DefVerify on six benchmark datasets revealing model-definition mismatches.

03

Provided insights for improving hate speech detection models.

Abstract

When building a predictive model, it is often difficult to ensure that application-specific requirements are encoded by the model that will eventually be deployed. Consider researchers working on hate speech detection. They will have an idea of what is considered hate speech, but building a model that reflects their view accurately requires preserving those ideals throughout the workflow of data set construction and model training. Complications such as sampling bias, annotation bias, and model misspecification almost always arise, possibly resulting in a gap between the application specification and the model's actual behavior upon deployment. To address this issue for hate speech detection, we propose DefVerify: a 3-step procedure that (i) encodes a user-specified definition of hate speech, (ii) quantifies to what extent the model reflects the intended definition, and (iii) tries to…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

urjakh/defverify
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHate Speech and Cyberbullying Detection

MethodsSparse Evolutionary Training