Oral to Web: Digitizing 'Zero Resource'Languages of Bangladesh
Mohammad Mamun Or Rashid

TL;DR
This paper introduces the Multilingual Cloud Corpus, a comprehensive digital dataset of Bangladesh's minority languages, enabling research in endangered language documentation and low-resource NLP.
Contribution
It presents the first large-scale, multimodal, cross-family linguistic dataset for Bangladesh's minority languages, including systematic fieldwork, data collection, and public accessibility.
Findings
Dataset includes 85,792 textual entries and 107 hours of audio recordings.
Covers 42 language varieties from four major language families.
Facilitates endangered language preservation and NLP research.
Abstract
We present the Multilingual Cloud Corpus, the first national-scale, parallel, multimodal linguistic dataset of Bangladesh's ethnic and indigenous languages. Despite being home to approximately 40 minority languages spanning four language families, Bangladesh has lacked a systematic, cross-family digital corpus for these predominantly oral, computationally "zero resource" varieties, 14 of which are classified as endangered. Our corpus comprises 85792 structured textual entries, each containing a Bengali stimulus text, an English translation, and an IPA transcription, together with approximately 107 hours of transcribed audio recordings, covering 42 language varieties from the Tibeto-Burman, Indo-European, Austro-Asiatic, and Dravidian families, plus two genetically unclassified languages. The data were collected through systematic fieldwork over 90 days across nine districts of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsLanguage and cultural evolution · Natural Language Processing Techniques · Multilingual Education and Policy
