PHINC: A Parallel Hinglish Social Media Code-Mixed Corpus for Machine Translation
Vivek Srivastava, Mayank Singh

TL;DR
This paper introduces PHINC, a large manually annotated parallel corpus of English-Hindi code-mixed social media sentences, aimed at advancing machine translation research in multilingual and informal language contexts.
Contribution
It provides one of the first extensive parallel corpora for English-Hindi code-mixed data, enabling improved machine translation models for social media content.
Findings
Corpus contains 13,738 manually translated code-mixed sentences.
Facilitates future research in code-mixed machine translation.
Supports development of better NLP tools for social media languages.
Abstract
Code-mixing is the phenomenon of using more than one language in a sentence. It is a very frequently observed pattern of communication on social media platforms. Flexibility to use multiple languages in one text message might help to communicate efficiently with the target audience. But, it adds to the challenge of processing and understanding natural language to a much larger extent. This paper presents a parallel corpus of the 13,738 code-mixed English-Hindi sentences and their corresponding translation in English. The translations of sentences are done manually by the annotators. We are releasing the parallel corpus to facilitate future research opportunities in code-mixed machine translation. The annotated corpus is available at https://doi.org/10.5281/zenodo.3605597.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Speech and dialogue systems · Hate Speech and Cyberbullying Detection
