On Importance of Code-Mixed Embeddings for Hate Speech Identification

Shruti Jagdale; Omkar Khade; Gauri Takalikar; Mihir Inamdar; Raviraj; Joshi

arXiv:2411.18577·cs.CL·November 28, 2024

On Importance of Code-Mixed Embeddings for Hate Speech Identification

Shruti Jagdale, Omkar Khade, Gauri Takalikar, Mihir Inamdar, Raviraj, Joshi

PDF

Open Access

TL;DR

This paper investigates the importance of code-mixed embeddings for hate speech detection, showing that HingBERT and Hing-FastText models trained on Hindi-English data outperform standard models on code-mixed hate speech datasets.

Contribution

It demonstrates the effectiveness of code-mixed embeddings and models trained on multilingual data for improving hate speech detection accuracy.

Findings

01

HingBERT outperforms BERT on hate speech detection in code-mixed data.

02

Hing-FastText surpasses standard FastText and vanilla BERT models.

03

Training on extensive Hindi-English data enhances model performance.

Abstract

Code-mixing is the practice of using two or more languages in a single sentence, which often occurs in multilingual communities such as India where people commonly speak multiple languages. Classic NLP tools, trained on monolingual data, face challenges when dealing with code-mixed data. Extracting meaningful information from sentences containing multiple languages becomes difficult, particularly in tasks like hate speech detection, due to linguistic variation, cultural nuances, and data sparsity. To address this, we aim to analyze the significance of code-mixed embeddings and evaluate the performance of BERT and HingBERT models (trained on a Hindi-English corpus) in hate speech detection. Our study demonstrates that HingBERT models, benefiting from training on the extensive Hindi-English dataset L3Cube-HingCorpus, outperform BERT models when tested on hate speech text datasets. We also…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHate Speech and Cyberbullying Detection

MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Attention Is All You Need · Linear Layer · Linear Warmup With Linear Decay · Layer Normalization · Adam · Residual Connection · Weight Decay · Softmax · Multi-Head Attention