Developing a Multilingual Annotated Corpus of Misogyny and Aggression
Shiladitya Bhattacharya, Siddharth Singh, Ritesh Kumar, Akanksha, Bansal, Akash Bhagat, Yogesh Dawer, Bornini Lahiri, Atul Kr. Ojha

TL;DR
This paper presents the creation of a multilingual annotated corpus of social media comments in Indian languages, focusing on misogyny and aggression, and reports baseline classification results.
Contribution
It introduces a new multilingual dataset with detailed annotations for misogyny and aggression in Indian social media comments, along with baseline classification experiments.
Findings
Over 20,000 comments annotated across three languages
Baseline classifiers achieved preliminary accuracy in detecting misogyny
Highlights challenges in annotating multilingual social media data
Abstract
In this paper, we discuss the development of a multilingual annotated corpus of misogyny and aggression in Indian English, Hindi, and Indian Bangla as part of a project on studying and automatically identifying misogyny and communalism on social media (the ComMA Project). The dataset is collected from comments on YouTube videos and currently contains a total of over 20,000 comments. The comments are annotated at two levels - aggression (overtly aggressive, covertly aggressive, and non-aggressive) and misogyny (gendered and non-gendered). We describe the process of data collection, the tagset used for annotation, and issues and challenges faced during the process of annotation. Finally, we discuss the results of the baseline experiments conducted to develop a classifier for misogyny in the three languages.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHate Speech and Cyberbullying Detection · Gender, Feminism, and Media · Bullying, Victimization, and Aggression
