L3Cube-HingCorpus and HingBERT: A Code Mixed Hindi-English Dataset and   BERT Language Models

Ravindra Nayak; Raviraj Joshi

arXiv:2204.08398·cs.CL·April 19, 2022·21 cites

L3Cube-HingCorpus and HingBERT: A Code Mixed Hindi-English Dataset and BERT Language Models

Ravindra Nayak, Raviraj Joshi

PDF

Open Access 1 Repo 10 Models

TL;DR

This paper introduces L3Cube-HingCorpus, a large-scale Hindi-English code-mixed dataset, and develops several transformer models including HingBERT and HingGPT, demonstrating their effectiveness on various NLP tasks and providing resources for future code-mixed language processing.

Contribution

The paper presents the first large-scale real Hindi-English code-mixed dataset and pre-trained transformer models specifically designed for code-mixed NLP tasks.

Findings

01

HingBERT models outperform existing models on code-mixed NLP tasks.

02

HingGPT can generate coherent full tweets in code-mixed language.

03

The datasets and models are publicly available for further research.

Abstract

Code-switching occurs when more than one language is mixed in a given sentence or a conversation. This phenomenon is more prominent on social media platforms and its adoption is increasing over time. Therefore code-mixed NLP has been extensively studied in the literature. As pre-trained transformer-based architectures are gaining popularity, we observe that real code-mixing data are scarce to pre-train large language models. We present L3Cube-HingCorpus, the first large-scale real Hindi-English code mixed data in a Roman script. It consists of 52.93M sentences and 1.04B tokens, scraped from Twitter. We further present HingBERT, HingMBERT, HingRoBERTa, and HingGPT. The BERT models have been pre-trained on codemixed HingCorpus using masked language modelling objectives. We show the effectiveness of these BERT models on the subsequent downstream tasks like code-mixed sentiment analysis,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

l3cube-pune/code-mixed-nlp
noneOfficial

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Text Readability and Simplification · Hate Speech and Cyberbullying Detection

MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Multi-Head Attention · Attention Is All You Need · Linear Layer · Adam · Residual Connection · Dense Connections · Attention Dropout · Layer Normalization · Weight Decay