My Boli: Code-mixed Marathi-English Corpora, Pretrained Language Models   and Evaluation Benchmarks

Tanmay Chavan; Omkar Gokhale; Aditya Kane; Shantanu Patankar; Raviraj; Joshi

arXiv:2306.14030·cs.CL·July 21, 2023

My Boli: Code-mixed Marathi-English Corpora, Pretrained Language Models and Evaluation Benchmarks

Tanmay Chavan, Omkar Gokhale, Aditya Kane, Shantanu Patankar, Raviraj, Joshi

PDF

Open Access 1 Repo 10 Models

TL;DR

This paper introduces a large code-mixed Marathi-English corpus, pre-trained transformer models, and benchmark datasets for downstream tasks, significantly advancing research in low-resource code-mixed Indian languages.

Contribution

It provides the first dedicated code-mixed Marathi-English corpus, models, and evaluation benchmarks, filling a critical gap in low-resource language processing.

Findings

01

Models trained on the corpus outperform existing BERT models.

02

The datasets enable effective hate speech, sentiment, and language identification tasks.

03

All resources are publicly available for research use.

Abstract

The research on code-mixed data is limited due to the unavailability of dedicated code-mixed datasets and pre-trained language models. In this work, we focus on the low-resource Indian language Marathi which lacks any prior work in code-mixing. We present L3Cube-MeCorpus, a large code-mixed Marathi-English (Mr-En) corpus with 10 million social media sentences for pretraining. We also release L3Cube-MeBERT and MeRoBERTa, code-mixed BERT-based transformer models pre-trained on MeCorpus. Furthermore, for benchmarking, we present three supervised datasets MeHate, MeSent, and MeLID for downstream tasks like code-mixed Mr-En hate speech detection, sentiment analysis, and language identification respectively. These evaluation datasets individually consist of manually annotated \url{~}12,000 Marathi-English code-mixed tweets. Ablations show that the models trained on this novel corpus…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

l3cube-pune/MarathiNLP
noneOfficial

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHate Speech and Cyberbullying Detection · Natural Language Processing Techniques · Interpreting and Communication in Healthcare

MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Multi-Head Attention · Attention Is All You Need · Linear Layer · Layer Normalization · Attention Dropout · WordPiece · Dense Connections · Adam · Residual Connection