Development of POS tagger for English-Bengali Code-Mixed data
Tathagata Raha, Sainik Kumar Mahata, Dipankar Das, Sivaji, Bandyopadhyay

TL;DR
This paper presents a modular POS tagging system for English-Bengali code-mixed social media texts, achieving 75.29% accuracy by combining language identification and language-specific POS taggers.
Contribution
It introduces a novel modular approach for POS tagging of code-mixed data, integrating language identification with separate POS taggers for each language.
Findings
Achieved 75.29% accuracy on code-mixed tweets.
Developed a modular system combining language identification and language-specific POS tagging.
Created a dataset of manually tagged code-mixed sentences for evaluation.
Abstract
Code-mixed texts are widespread nowadays due to the advent of social media. Since these texts combine two languages to formulate a sentence, it gives rise to various research problems related to Natural Language Processing. In this paper, we try to excavate one such problem, namely, Parts of Speech tagging of code-mixed texts. We have built a system that can POS tag English-Bengali code-mixed data where the Bengali words were written in Roman script. Our approach initially involves the collection and cleaning of English-Bengali code-mixed tweets. These tweets were used as a development dataset for building our system. The proposed system is a modular approach that starts by tagging individual tokens with their respective languages and then passes them to different POS taggers, designed for different languages (English and Bengali, in our case). Tags given by the two systems are later…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Algorithms and Data Compression
