DefVerify: Do Hate Speech Models Reflect Their Dataset's Definition?
Urja Khurana, Eric Nalisnick, Antske Fokkens

TL;DR
DefVerify is a method that assesses whether hate speech detection models accurately reflect their intended definitions by analyzing datasets and model behavior, identifying gaps and failure points in the workflow.
Contribution
It introduces a three-step procedure to encode, quantify, and diagnose the alignment between hate speech definitions and model behavior.
Findings
Identified gaps between dataset annotations and user-defined hate speech concepts.
Demonstrated DefVerify on six benchmark datasets revealing model-definition mismatches.
Provided insights for improving hate speech detection models.
Abstract
When building a predictive model, it is often difficult to ensure that application-specific requirements are encoded by the model that will eventually be deployed. Consider researchers working on hate speech detection. They will have an idea of what is considered hate speech, but building a model that reflects their view accurately requires preserving those ideals throughout the workflow of data set construction and model training. Complications such as sampling bias, annotation bias, and model misspecification almost always arise, possibly resulting in a gap between the application specification and the model's actual behavior upon deployment. To address this issue for hate speech detection, we propose DefVerify: a 3-step procedure that (i) encodes a user-specified definition of hate speech, (ii) quantifies to what extent the model reflects the intended definition, and (iii) tries to…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHate Speech and Cyberbullying Detection
MethodsSparse Evolutionary Training
