Task-Lens: Cross-Task Utility Based Speech Dataset Profiling for Low-Resource Indian Languages

Swati Sharma; Divya V. Sharma; Anubha Gupta

arXiv:2602.23388·cs.CL·March 2, 2026

Task-Lens: Cross-Task Utility Based Speech Dataset Profiling for Low-Resource Indian Languages

Swati Sharma, Divya V. Sharma, Anubha Gupta

PDF

Open Access

TL;DR

Task-Lens provides a comprehensive cross-task profiling of Indian speech datasets, revealing untapped potential and gaps, to enhance resource utilization and guide future dataset creation for low-resource languages.

Contribution

It introduces a novel cross-task survey methodology for Indian speech datasets, assessing their suitability across multiple NLP tasks and identifying underserved languages and tasks.

Findings

01

Many datasets contain metadata useful for multiple tasks.

02

Cross-task linkages can be leveraged to improve dataset utility.

03

Identifies critical gaps in resources for certain languages and tasks.

Abstract

The rising demand for inclusive speech technologies amplifies the need for multilingual datasets for Natural Language Processing (NLP) research. However, limited awareness of existing task-specific resources in low-resource languages hinders research. This challenge is especially acute in linguistically diverse countries, such as India. Cross-task profiling of existing Indian speech datasets can alleviate the data scarcity challenge. This involves investigating the utility of datasets across multiple downstream tasks rather than focusing on a single task. Prior surveys typically catalogue datasets for a single task, leaving comprehensive cross-task profiling as an open opportunity. Therefore, we propose Task-Lens, a cross-task survey that assesses the readiness of 50 Indian speech datasets spanning 26 languages for nine downstream speech tasks. First, we analyze which datasets contain…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsICT in Developing Communities · Natural Language Processing Techniques · Speech Recognition and Synthesis