Prabhupadavani: A Code-mixed Speech Translation Data for 25 Languages
Jivnesh Sandhan, Ayush Daksh, Om Adideva Paranjay, Laxmidhar Behera, and Pawan Goyal

TL;DR
Prabhupadavani introduces the first multilingual code-mixed speech translation dataset covering 25 languages, enabling research in code-mixed speech translation and related tasks, with applications in cultural and linguistic studies.
Contribution
It provides the first large-scale, multi-domain, code-mixed speech translation dataset for 25 languages, addressing a significant gap in NLP and speech translation research.
Findings
Dataset contains 94 hours of speech data from 130+ speakers.
Manually aligned with target language text, covering diverse language families.
Facilitates research in code-mixed speech translation and machine translation.
Abstract
Nowadays, the interest in code-mixing has become ubiquitous in Natural Language Processing (NLP); however, not much attention has been given to address this phenomenon for Speech Translation (ST) task. This can be solely attributed to the lack of code-mixed ST task labelled data. Thus, we introduce Prabhupadavani, which is a multilingual code-mixed ST dataset for 25 languages. It is multi-domain, covers ten language families, containing 94 hours of speech by 130+ speakers, manually aligned with corresponding text in the target language. The Prabhupadavani is about Vedic culture and heritage from Indic literature, where code-switching in the case of quotation from literature is important in the context of humanities teaching. To the best of our knowledge, Prabhupadvani is the first multi-lingual code-mixed ST dataset available in the ST literature. This data also can be used for a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Translation Studies and Practices · Topic Modeling
