L3Cube-HingCorpus and HingBERT: A Code Mixed Hindi-English Dataset and BERT Language Models
Ravindra Nayak, Raviraj Joshi

TL;DR
This paper introduces L3Cube-HingCorpus, a large-scale Hindi-English code-mixed dataset, and develops several transformer models including HingBERT and HingGPT, demonstrating their effectiveness on various NLP tasks and providing resources for future code-mixed language processing.
Contribution
The paper presents the first large-scale real Hindi-English code-mixed dataset and pre-trained transformer models specifically designed for code-mixed NLP tasks.
Findings
HingBERT models outperform existing models on code-mixed NLP tasks.
HingGPT can generate coherent full tweets in code-mixed language.
The datasets and models are publicly available for further research.
Abstract
Code-switching occurs when more than one language is mixed in a given sentence or a conversation. This phenomenon is more prominent on social media platforms and its adoption is increasing over time. Therefore code-mixed NLP has been extensively studied in the literature. As pre-trained transformer-based architectures are gaining popularity, we observe that real code-mixing data are scarce to pre-train large language models. We present L3Cube-HingCorpus, the first large-scale real Hindi-English code mixed data in a Roman script. It consists of 52.93M sentences and 1.04B tokens, scraped from Twitter. We further present HingBERT, HingMBERT, HingRoBERTa, and HingGPT. The BERT models have been pre-trained on codemixed HingCorpus using masked language modelling objectives. We show the effectiveness of these BERT models on the subsequent downstream tasks like code-mixed sentiment analysis,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗l3cube-pune/hing-bertmodel· 169 dl· ♡ 2169 dl♡ 2
- 🤗l3cube-pune/hing-mbertmodel· 58 dl· ♡ 258 dl♡ 2
- 🤗l3cube-pune/hing-robertamodel· 773 dl· ♡ 1773 dl♡ 1
- 🤗l3cube-pune/hing-bert-lidmodel· 104 dl· ♡ 1104 dl♡ 1
- 🤗l3cube-pune/hing-gptmodel· 27 dl27 dl
- 🤗l3cube-pune/hing-gpt-devanagarimodel· 13 dl· ♡ 113 dl♡ 1
- 🤗l3cube-pune/hing-mbert-mixedmodel· 73 dl· ♡ 173 dl♡ 1
- 🤗l3cube-pune/hing-roberta-mixedmodel· 111 dl· ♡ 1111 dl♡ 1
- 🤗l3cube-pune/hing-mbert-mixed-v2model· 17 dl17 dl
- 🤗l3cube-pune/hing-fast-text-embeddingmodel
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Text Readability and Simplification · Hate Speech and Cyberbullying Detection
MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Multi-Head Attention · Attention Is All You Need · Linear Layer · Adam · Residual Connection · Dense Connections · Attention Dropout · Layer Normalization · Weight Decay
