PHINC: A Parallel Hinglish Social Media Code-Mixed Corpus for Machine   Translation

Vivek Srivastava; Mayank Singh

arXiv:2004.09447·cs.CL·April 21, 2020·20 cites

PHINC: A Parallel Hinglish Social Media Code-Mixed Corpus for Machine Translation

Vivek Srivastava, Mayank Singh

PDF

Open Access

TL;DR

This paper introduces PHINC, a large manually annotated parallel corpus of English-Hindi code-mixed social media sentences, aimed at advancing machine translation research in multilingual and informal language contexts.

Contribution

It provides one of the first extensive parallel corpora for English-Hindi code-mixed data, enabling improved machine translation models for social media content.

Findings

01

Corpus contains 13,738 manually translated code-mixed sentences.

02

Facilitates future research in code-mixed machine translation.

03

Supports development of better NLP tools for social media languages.

Abstract

Code-mixing is the phenomenon of using more than one language in a sentence. It is a very frequently observed pattern of communication on social media platforms. Flexibility to use multiple languages in one text message might help to communicate efficiently with the target audience. But, it adds to the challenge of processing and understanding natural language to a much larger extent. This paper presents a parallel corpus of the 13,738 code-mixed English-Hindi sentences and their corresponding translation in English. The translations of sentences are done manually by the annotators. We are releasing the parallel corpus to facilitate future research opportunities in code-mixed machine translation. The annotated corpus is available at https://doi.org/10.5281/zenodo.3605597.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Speech and dialogue systems · Hate Speech and Cyberbullying Detection