A POS Tagger for Code Mixed Indian Social Media Text - ICON-2016 NLP Tools Contest Entry from Surukam
Sree Harsha Ramesh, Raveena R Kumar

TL;DR
This paper presents a POS tagger for code-mixed Indian social media text, using CRFs and features like character n-grams and language info, achieving competitive accuracy without monolingual resources.
Contribution
The paper introduces a CRF-based POS tagging approach for code-mixed Indian social media text, using novel features and operating without monolingual POS taggers, advancing the state-of-the-art.
Findings
Achieved an average F1-score of 76.45%.
Effective use of features like emoji, web addresses, and language info.
Comparable performance to the 2015 winning system.
Abstract
Building Part-of-Speech (POS) taggers for code-mixed Indian languages is a particularly challenging problem in computational linguistics due to a dearth of accurately annotated training corpora. ICON, as part of its NLP tools contest has organized this challenge as a shared task for the second consecutive year to improve the state-of-the-art. This paper describes the POS tagger built at Surukam to predict the coarse-grained and fine-grained POS tags for three language pairs - Bengali-English, Telugu-English and Hindi-English, with the text spanning three popular social media platforms - Facebook, WhatsApp and Twitter. We employed Conditional Random Fields as the sequence tagging algorithm and used a library called sklearn-crfsuite - a thin wrapper around CRFsuite for training our model. Among the features we used include - character n-grams, language information and patterns for emoji,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Speech Recognition and Synthesis
