TL;DR
This paper introduces a novel algorithm that uses dependency parsing to generate grammatically coherent code-switched sentences for English-Hindi, English-Marathi, and English-Kannada, addressing data scarcity in multilingual NLP.
Contribution
The paper presents a new dependency parsing-based method for creating realistic code-switched data, significantly increasing data volume from minimal input while maintaining grammaticality.
Findings
Algorithm effectively generates grammatically sensible code-switched sentences.
Generated data improves NLP task performance baselines.
Qualitative metrics confirm the quality of synthetic code-switched data.
Abstract
Codeswitching has become one of the most common occurrences across multilingual speakers of the world, especially in countries like India which encompasses around 23 official languages with the number of bilingual speakers being around 300 million. The scarcity of Codeswitched data becomes a bottleneck in the exploration of this domain with respect to various Natural Language Processing (NLP) tasks. We thus present a novel algorithm which harnesses the syntactic structure of English grammar to develop grammatically sensible Codeswitched versions of English-Hindi, English-Marathi and English-Kannada data. Apart from maintaining the grammatical sanity to a great extent, our methodology also guarantees abundant generation of data from a minuscule snapshot of given data. We use multiple datasets to showcase the capabilities of our algorithm while at the same time we assess the quality of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
