Development of POS tagger for English-Bengali Code-Mixed data

Tathagata Raha; Sainik Kumar Mahata; Dipankar Das; Sivaji; Bandyopadhyay

arXiv:2007.14576·cs.CL·July 30, 2020

Development of POS tagger for English-Bengali Code-Mixed data

Tathagata Raha, Sainik Kumar Mahata, Dipankar Das, Sivaji, Bandyopadhyay

PDF

Open Access

TL;DR

This paper presents a modular POS tagging system for English-Bengali code-mixed social media texts, achieving 75.29% accuracy by combining language identification and language-specific POS taggers.

Contribution

It introduces a novel modular approach for POS tagging of code-mixed data, integrating language identification with separate POS taggers for each language.

Findings

01

Achieved 75.29% accuracy on code-mixed tweets.

02

Developed a modular system combining language identification and language-specific POS tagging.

03

Created a dataset of manually tagged code-mixed sentences for evaluation.

Abstract

Code-mixed texts are widespread nowadays due to the advent of social media. Since these texts combine two languages to formulate a sentence, it gives rise to various research problems related to Natural Language Processing. In this paper, we try to excavate one such problem, namely, Parts of Speech tagging of code-mixed texts. We have built a system that can POS tag English-Bengali code-mixed data where the Bengali words were written in Roman script. Our approach initially involves the collection and cleaning of English-Bengali code-mixed tweets. These tweets were used as a development dataset for building our system. The proposed system is a modular approach that starts by tagging individual tokens with their respective languages and then passes them to different POS taggers, designed for different languages (English and Bengali, in our case). Tags given by the two systems are later…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling · Algorithms and Data Compression