My Boli: Code-mixed Marathi-English Corpora, Pretrained Language Models and Evaluation Benchmarks
Tanmay Chavan, Omkar Gokhale, Aditya Kane, Shantanu Patankar, Raviraj, Joshi

TL;DR
This paper introduces a large code-mixed Marathi-English corpus, pre-trained transformer models, and benchmark datasets for downstream tasks, significantly advancing research in low-resource code-mixed Indian languages.
Contribution
It provides the first dedicated code-mixed Marathi-English corpus, models, and evaluation benchmarks, filling a critical gap in low-resource language processing.
Findings
Models trained on the corpus outperform existing BERT models.
The datasets enable effective hate speech, sentiment, and language identification tasks.
All resources are publicly available for research use.
Abstract
The research on code-mixed data is limited due to the unavailability of dedicated code-mixed datasets and pre-trained language models. In this work, we focus on the low-resource Indian language Marathi which lacks any prior work in code-mixing. We present L3Cube-MeCorpus, a large code-mixed Marathi-English (Mr-En) corpus with 10 million social media sentences for pretraining. We also release L3Cube-MeBERT and MeRoBERTa, code-mixed BERT-based transformer models pre-trained on MeCorpus. Furthermore, for benchmarking, we present three supervised datasets MeHate, MeSent, and MeLID for downstream tasks like code-mixed Mr-En hate speech detection, sentiment analysis, and language identification respectively. These evaluation datasets individually consist of manually annotated \url{~}12,000 Marathi-English code-mixed tweets. Ablations show that the models trained on this novel corpus…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗l3cube-pune/me-bertmodel· 2 dl2 dl
- 🤗l3cube-pune/me-bert-mixedmodel
- 🤗l3cube-pune/me-robertamodel· 4 dl4 dl
- 🤗l3cube-pune/me-roberta-mixedmodel· 26 dl26 dl
- 🤗l3cube-pune/me-sent-robertamodel· 1 dl1 dl
- 🤗l3cube-pune/me-hate-robertamodel· 4 dl4 dl
- 🤗l3cube-pune/me-lid-robertamodel· 3 dl3 dl
- 🤗l3cube-pune/me-bert-mixed-v2model· 1 dl1 dl
- 🤗l3cube-pune/me-lid-bertmodel· 88 dl88 dl
- 🤗l3cube-pune/me-hate-bertmodel· 3 dl3 dl
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHate Speech and Cyberbullying Detection · Natural Language Processing Techniques · Interpreting and Communication in Healthcare
MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Multi-Head Attention · Attention Is All You Need · Linear Layer · Layer Normalization · Attention Dropout · WordPiece · Dense Connections · Adam · Residual Connection
