Normalization of Transliterated Words in Code-Mixed Data Using Seq2Seq   Model & Levenshtein Distance

Soumil Mandal; Karthick Nanmaran

arXiv:1805.08701·cs.CL·May 23, 2018·1 cites

Normalization of Transliterated Words in Code-Mixed Data Using Seq2Seq Model & Levenshtein Distance

Soumil Mandal, Karthick Nanmaran

PDF

Open Access

TL;DR

This paper introduces a novel seq2seq-based architecture for normalizing phonetic variations in code-mixed social media data, also capable of back-transliteration and word identification, achieving over 90% accuracy.

Contribution

The work presents a new model that effectively normalizes transliterated words in code-mixed data, addressing phonetic spelling variations and enhancing NLP tools for social media analysis.

Findings

01

Achieved 90.27% accuracy on test data.

02

Model effectively normalizes phonetic spelling variations.

03

Supports back-transliteration and word identification.

Abstract

Building tools for code-mixed data is rapidly gaining popularity in the NLP research community as such data is exponentially rising on social media. Working with code-mixed data contains several challenges, especially due to grammatical inconsistencies and spelling variations in addition to all the previous known challenges for social media scenarios. In this article, we present a novel architecture focusing on normalizing phonetic typing variations, which is commonly seen in code-mixed data. One of the main features of our architecture is that in addition to normalizing, it can also be utilized for back-transliteration and word identification in some cases. Our model achieved an accuracy of 90.27% on the test data.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Speech Recognition and Synthesis · Topic Modeling