IndRegBias: A Dataset for Studying Indian Regional Biases in English and Code-Mixed Social Media Comments
Debasmita Panda, Akash Anil, Neelesh Kumar Shukla

TL;DR
This paper introduces IndRegBias, a new dataset of 25,000 social media comments from India annotated for regional biases, and evaluates various language models' ability to detect and assess bias severity.
Contribution
The paper presents a novel dataset for Indian regional bias in social media comments and proposes a multilevel annotation strategy along with an evaluation of language models for bias detection.
Findings
Fine-tuning improves model accuracy in bias detection.
Zero-shot and few-shot methods show lower accuracy.
The dataset enables better understanding of regional biases in Indian social media comments.
Abstract
Warning: This paper consists of examples representing regional biases in Indian regions that might be offensive towards a particular region. While social biases corresponding to gender, race, socio-economic conditions, etc., have been extensively studied in the major applications of Natural Language Processing (NLP), biases corresponding to regions have garnered less attention. This is mainly because of (i) difficulty in the extraction of regional bias datasets, (ii) disagreements in annotation due to inherent human biases, and (iii) regional biases being studied in combination with other types of social biases and often being under-represented. This paper focuses on creating a dataset IndRegBias, consisting of regional biases in an Indian context reflected in users' comments on popular social media platforms, namely Reddit and YouTube. We carefully selected 25,000 comments appearing…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSentiment Analysis and Opinion Mining · Hate Speech and Cyberbullying Detection · Complex Network Analysis Techniques
