NIRANTAR: Continual Learning with New Languages and Domains on Real-world Speech Data
Tahir Javed, Kaushal Bhogale, Mitesh M. Khapra

TL;DR
Nirantar is a new framework for evaluating continual learning in multilingual, multi-domain speech recognition using real-world, incrementally collected data across diverse languages and domains, highlighting challenges and gaps in current methods.
Contribution
We introduce Nirantar, a comprehensive, real-world benchmark for continual learning in multilingual and multi-domain ASR, with dynamic data shifts and extensive speech data from India.
Findings
Existing CL methods lack consistent performance.
Nirantar enables systematic benchmarking of CL approaches.
Real-world data presents unique challenges for CL in ASR.
Abstract
We introduce Nirantar, a comprehensive framework for evaluating continual learning (CL) in multilingual and multi-domain ASR. Designed to reflect real-world CL challenges, Nirantar leverages data collected incrementally across 22 languages and 208 districts in India through natural episodes. This enables evaluation across Language-Incremental (LIL), Domain-Incremental (DIL), and the novel Language-Incremental Domain-Incremental Learning (LIDIL) scenarios. Unlike prior work that relies on simulated episodes, Nirantar presents dynamic, non-uniform language and domain shifts, making it an ideal testbed for CL research. With 3250 hours of human-transcribed speech, including 1720 hours newly introduced in this work, our framework enables systematic benchmarking of CL methods. We evaluate existing approaches and demonstrate that no single method performs consistently well, underscoring the…
Peer Reviews
Decision·Submitted to ICLR 2025
- The dataset encompasses a diverse range of languages across India. - They offer three distinct continual learning settings: Language-Incremental (LIL), Domain-Incremental (DIL), and the novel Language-Incremental Domain-Incremental Learning (LIDIL). - The paper is well-written and easy to follow, with a comprehensive analysis included.
- The first two settings, Language-Incremental and Domain-Incremental, have been explored in previous studies, including CL-mARS, which covers various CL methods in multilingual contexts (https://arxiv.org/pdf/2310.16931), and works by Sadhu et al. (https://www.isca-archive.org/interspeech_2020/sadhu20_interspeech.pdf), Chang et al. (https://arxiv.org/abs/2104.01616), and Li et al. (https://arxiv.org/abs/2302.01496), which investigate incremental domain setups. In light of these prior studies, t
- **Novel Contribution:** The work is original, not only contributing newly collected data, but including a novel Continual Scenario for ASR: Language-Incremental Domain-Incremental Learning (LIDIL). This scenario provides a more realistic view of real-world multilingual and multi-domain environments compared to many CL datasets that rely on synthetic or structured data, making its impact relevant beyond the specific languages covered. Comprehensive Evaluation: The evaluation is extensive, cover
- **Limited Baselines:** While the paper evaluates CL methods from replay-based and regularization-based approaches, it excludes architecture-based methods arguing they are impractical for real-world settings as they add parameters for each new language and domain, leading to excessive complexity as episodes increase (lines 328-332). However, this overlooks recent advances in lightweight architecture-based methods like language-specific adapters, which have improved ASR performance even in domai
The dataset’s design closely mirrors real-world CL requirements, providing a valuable resource for research on multilingual, multi-domain, and incremental learning. The inclusion of both widely used CL scenarios (LIL, DIL) and the novel LIDIL scenario represents a significant contribution, opening new avenues for research. Evaluations of CL approaches provide useful initial insights, underscoring the complexity and variability of performance across the dataset’s scenarios.
Figures 3-5 in the report show fluctuating trends, which might indicate an imbalance in the amount of data available for each language. This imbalance could impact the model's performance on specific low-resource languages, potentially limiting its effectiveness in some areas. While the evaluations provide a starting point, a deeper exploration of algorithmic adaptations or optimizations specifically for LIDIL could enhance the study’s practical impact.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDomain Adaptation and Few-Shot Learning · Interpreting and Communication in Healthcare · Speech Recognition and Synthesis
