A POS Tagger for Code Mixed Indian Social Media Text - ICON-2016 NLP   Tools Contest Entry from Surukam

Sree Harsha Ramesh; Raveena R Kumar

arXiv:1701.00066·cs.CL·January 3, 2017·1 cites

A POS Tagger for Code Mixed Indian Social Media Text - ICON-2016 NLP Tools Contest Entry from Surukam

Sree Harsha Ramesh, Raveena R Kumar

PDF

Open Access

TL;DR

This paper presents a POS tagger for code-mixed Indian social media text, using CRFs and features like character n-grams and language info, achieving competitive accuracy without monolingual resources.

Contribution

The paper introduces a CRF-based POS tagging approach for code-mixed Indian social media text, using novel features and operating without monolingual POS taggers, advancing the state-of-the-art.

Findings

01

Achieved an average F1-score of 76.45%.

02

Effective use of features like emoji, web addresses, and language info.

03

Comparable performance to the 2015 winning system.

Abstract

Building Part-of-Speech (POS) taggers for code-mixed Indian languages is a particularly challenging problem in computational linguistics due to a dearth of accurately annotated training corpora. ICON, as part of its NLP tools contest has organized this challenge as a shared task for the second consecutive year to improve the state-of-the-art. This paper describes the POS tagger built at Surukam to predict the coarse-grained and fine-grained POS tags for three language pairs - Bengali-English, Telugu-English and Hindi-English, with the text spanning three popular social media platforms - Facebook, WhatsApp and Twitter. We employed Conditional Random Fields as the sequence tagging algorithm and used a library called sklearn-crfsuite - a thin wrapper around CRFsuite for training our model. Among the features we used include - character n-grams, language information and patterns for emoji,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling · Speech Recognition and Synthesis