Wavelet Scattering Transform for Improving Generalization in Low-Resourced Spoken Language Identification
Spandan Dey, Premjeet Singh, Goutam Saha

TL;DR
This paper explores the use of wavelet scattering transform features to enhance the generalization of low-resourced spoken language identification systems, outperforming traditional features like MFCC in various evaluations.
Contribution
It introduces WST as a novel feature for LID, optimizes its parameters for South Asian languages, and develops fused ECAPA-TDNN systems to improve performance on unseen data.
Findings
WST features reduce EER by up to 14.05% in same-corpora evaluations.
WST improves generalization in blind VoxLingua107 evaluations.
Optimal WST hyper-parameters depend on both training and testing datasets.
Abstract
Commonly used features in spoken language identification (LID), such as mel-spectrogram or MFCC, lose high-frequency information due to windowing. The loss further increases for longer temporal contexts. To improve generalization of the low-resourced LID systems, we investigate an alternate feature representation, wavelet scattering transform (WST), that compensates for the shortcomings. To our knowledge, WST is not explored earlier in LID tasks. We first optimize WST features for multiple South Asian LID corpora. We show that LID requires low octave resolution and frequency-scattering is not useful. Further, cross-corpora evaluations show that the optimal WST hyper-parameters depend on both train and test corpora. Hence, we develop fused ECAPA-TDNN based LID systems with different sets of WST hyper-parameters to improve generalization for unknown data. Compared to MFCC, EER is reduced…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing
