IndRegBias: A Dataset for Studying Indian Regional Biases in English and Code-Mixed Social Media Comments

Debasmita Panda; Akash Anil; Neelesh Kumar Shukla

arXiv:2601.06477·cs.CL·January 14, 2026

IndRegBias: A Dataset for Studying Indian Regional Biases in English and Code-Mixed Social Media Comments

Debasmita Panda, Akash Anil, Neelesh Kumar Shukla

PDF

Open Access

TL;DR

This paper introduces IndRegBias, a new dataset of 25,000 social media comments from India annotated for regional biases, and evaluates various language models' ability to detect and assess bias severity.

Contribution

The paper presents a novel dataset for Indian regional bias in social media comments and proposes a multilevel annotation strategy along with an evaluation of language models for bias detection.

Findings

01

Fine-tuning improves model accuracy in bias detection.

02

Zero-shot and few-shot methods show lower accuracy.

03

The dataset enables better understanding of regional biases in Indian social media comments.

Abstract

Warning: This paper consists of examples representing regional biases in Indian regions that might be offensive towards a particular region. While social biases corresponding to gender, race, socio-economic conditions, etc., have been extensively studied in the major applications of Natural Language Processing (NLP), biases corresponding to regions have garnered less attention. This is mainly because of (i) difficulty in the extraction of regional bias datasets, (ii) disagreements in annotation due to inherent human biases, and (iii) regional biases being studied in combination with other types of social biases and often being under-represented. This paper focuses on creating a dataset IndRegBias, consisting of regional biases in an Indian context reflected in users' comments on popular social media platforms, namely Reddit and YouTube. We carefully selected 25,000 comments appearing…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSentiment Analysis and Opinion Mining · Hate Speech and Cyberbullying Detection · Complex Network Analysis Techniques