A Quality Type-aware Annotated Corpus and Lexicon for Harassment Research
Mohammadreza Rezvan, Saeedeh Shekarpour, Lakshika Balasuriya,, Krishnaprasad Thirunarayan, Valerie Shalin, Amit Sheth

TL;DR
This paper introduces a high-quality, annotated Twitter corpus and lexicon for five types of harassment, providing a valuable resource for cyberbullying research and standard benchmarks.
Contribution
It presents the first annotated corpus and lexicon for multiple harassment types, enabling more accurate detection and analysis of cyberbullying behaviors.
Findings
25,000 annotated tweets across five harassment types
A new lexicon of offensive words for harassment detection
Resource shared publicly for research community
Abstract
Having a quality annotated corpus is essential especially for applied research. Despite the recent focus of Web science community on researching about cyberbullying, the community dose not still have standard benchmarks. In this paper, we publish first, a quality annotated corpus and second, an offensive words lexicon capturing different types type of harassment as (i) sexual harassment, (ii) racial harassment, (iii) appearance-related harassment, (iv) intellectual harassment, and (v) political harassment.We crawled data from Twitter using our offensive lexicon. Then relied on the human judge to annotate the collected tweets w.r.t. the contextual types because using offensive words is not sufficient to reliably detect harassment. Our corpus consists of 25,000 annotated tweets in five contextual types. We are pleased to share this novel annotated corpus and the lexicon with the research…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
