Current Limitations in Cyberbullying Detection: on Evaluation Criteria,   Reproducibility, and Data Scarcity

Chris Emmery; Ben Verhoeven; Guy De Pauw; Gilles Jacobs; Cynthia Van; Hee; Els Lefever; Bart Desmet; V\'eronique Hoste; Walter Daelemans

arXiv:1910.11922·cs.CL·August 16, 2021

Current Limitations in Cyberbullying Detection: on Evaluation Criteria, Reproducibility, and Data Scarcity

Chris Emmery, Ben Verhoeven, Guy De Pauw, Gilles Jacobs, Cynthia Van, Hee, Els Lefever, Bart Desmet, V\'eronique Hoste, Walter Daelemans

PDF

1 Repo

TL;DR

This paper critically examines current challenges in cyberbullying detection, highlighting data scarcity, evaluation issues, and proposing a crowdsourcing approach to generate more effective training data.

Contribution

It evaluates existing datasets, demonstrates cross-domain generalization issues, and introduces a crowdsourcing method to create realistic, useful data for training classifiers.

Findings

01

Existing datasets are small and heterogeneous, limiting model applicability.

02

Classifiers trained on current datasets lack cross-domain generalization.

03

Crowdsourcing can generate plausible, effective data to improve classifier performance.

Abstract

The detection of online cyberbullying has seen an increase in societal importance, popularity in research, and available open data. Nevertheless, while computational power and affordability of resources continue to increase, the access restrictions on high-quality data limit the applicability of state-of-the-art techniques. Consequently, much of the recent research uses small, heterogeneous datasets, without a thorough evaluation of applicability. In this paper, we further illustrate these issues, as we (i) evaluate many publicly available resources for this task and demonstrate difficulties with data collection. These predominantly yield small datasets that fail to capture the required complex social dynamics and impede direct comparison of progress. We (ii) conduct an extensive set of experiments that indicate a general lack of cross-domain generalization of classifiers trained on…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

cmry/amica
tfOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.