Language Identification of Hindi-English tweets using code-mixed BERT

Mohd Zeeshan Ansari; M M Sufyan Beg; Tanvir Ahmad; Mohd Jazib Khan,; Ghazali Wasim

arXiv:2107.01202·cs.CL·July 5, 2021

Language Identification of Hindi-English tweets using code-mixed BERT

Mohd Zeeshan Ansari, M M Sufyan Beg, Tanvir Ahmad, Mohd Jazib Khan,, Ghazali Wasim

PDF

Open Access

TL;DR

This paper explores the use of code-mixed BERT models for language identification in Hindi-English tweets, demonstrating improved performance through transfer learning and fine-tuning on social media data.

Contribution

It introduces a novel approach of pre-training BERT on code-mixed data for language identification, showing enhanced results over monolingual models.

Findings

01

Pre-trained models on code-mixed data outperform monolingual models.

02

Fine-tuning BERT improves language classification accuracy.

03

Utilizes a new dataset of Hindi-English-Urdu code-mixed tweets.

Abstract

Language identification of social media text has been an interesting problem of study in recent years. Social media messages are predominantly in code mixed in non-English speaking states. Prior knowledge by pre-training contextual embeddings have shown state of the art results for a range of downstream tasks. Recently, models such as BERT have shown that using a large amount of unlabeled data, the pretrained language models are even more beneficial for learning common language representations. Extensive experiments exploiting transfer learning and fine-tuning BERT models to identify language on Twitter are presented in this paper. The work utilizes a data collection of Hindi-English-Urdu codemixed text for language pre-training and Hindi-English codemixed for subsequent word-level language classification. The results show that the representations pre-trained over codemixed data produce…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Authorship Attribution and Profiling · Hate Speech and Cyberbullying Detection

MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Adam · Layer Normalization · Weight Decay · Dropout · Refunds@Expedia|||How do I get a full refund from Expedia? · Linear Warmup With Linear Decay · Residual Connection