Task-Lens: Cross-Task Utility Based Speech Dataset Profiling for Low-Resource Indian Languages
Swati Sharma, Divya V. Sharma, Anubha Gupta

TL;DR
Task-Lens provides a comprehensive cross-task profiling of Indian speech datasets, revealing untapped potential and gaps, to enhance resource utilization and guide future dataset creation for low-resource languages.
Contribution
It introduces a novel cross-task survey methodology for Indian speech datasets, assessing their suitability across multiple NLP tasks and identifying underserved languages and tasks.
Findings
Many datasets contain metadata useful for multiple tasks.
Cross-task linkages can be leveraged to improve dataset utility.
Identifies critical gaps in resources for certain languages and tasks.
Abstract
The rising demand for inclusive speech technologies amplifies the need for multilingual datasets for Natural Language Processing (NLP) research. However, limited awareness of existing task-specific resources in low-resource languages hinders research. This challenge is especially acute in linguistically diverse countries, such as India. Cross-task profiling of existing Indian speech datasets can alleviate the data scarcity challenge. This involves investigating the utility of datasets across multiple downstream tasks rather than focusing on a single task. Prior surveys typically catalogue datasets for a single task, leaving comprehensive cross-task profiling as an open opportunity. Therefore, we propose Task-Lens, a cross-task survey that assesses the readiness of 50 Indian speech datasets spanning 26 languages for nine downstream speech tasks. First, we analyze which datasets contain…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsICT in Developing Communities · Natural Language Processing Techniques · Speech Recognition and Synthesis
