How do datasets, developers, and models affect biases in a low-resourced language?: The Case of the Bengali Language
Dipto Das, Shion Guha, Bryan Semaan

TL;DR
This study empirically examines how datasets, developers, and models influence identity-based biases in Bengali language sentiment analysis, revealing persistent biases despite multilingual and low-resource adaptations.
Contribution
It provides an empirical analysis of biases in Bengali sentiment analysis models and highlights the impact of dataset and model choices in low-resource language contexts.
Findings
BSA models exhibit biases across gender, religion, and nationality.
Combining pre-trained models and datasets introduces inconsistencies.
Bias persists despite using multilingual and low-resource datasets.
Abstract
Sociotechnical systems, such as language technologies, frequently exhibit identity-based biases. These biases exacerbate the experiences of historically marginalized communities and remain understudied in low-resource contexts. While models and datasets specific to a language or with multilingual support are commonly recommended to address these biases, this paper empirically tests the effectiveness of such approaches in the context of gender, religion, and nationality-based identities in Bengali, a widely spoken but low-resourced language. We conducted an algorithmic audit of sentiment analysis models built on mBERT and BanglaBERT, which were fine-tuned using all Bengali sentiment analysis (BSA) datasets from Google Dataset Search. Our analyses showed that BSA models exhibit biases across different identity categories despite having similar semantic content and structure. We also…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
