Language-Universal Speech Attributes Modeling for Zero-Shot Multilingual Spoken Keyword Recognition
Hao Yen, Pin-Jui Ku, Sabato Marco Siniscalchi, Chin-Hui Lee

TL;DR
This paper introduces a language-universal speech attribute modeling approach for zero-shot multilingual spoken keyword recognition, utilizing self-supervised representations and universal speech attributes to improve recognition accuracy across multiple languages.
Contribution
It presents a novel framework combining self-supervised speech representations with universal speech attributes and domain adversarial training for improved multilingual SKR, especially in zero-shot scenarios.
Findings
Achieves comparable performance to character- and phoneme-based SKR in seen languages.
Outperforms existing methods with significant WER reductions in both seen and unseen languages.
Domain adversarial training enhances the robustness and accuracy of the proposed model.
Abstract
We propose a novel language-universal approach to end-to-end automatic spoken keyword recognition (SKR) leveraging upon (i) a self-supervised pre-trained model, and (ii) a set of universal speech attributes (manner and place of articulation). Specifically, Wav2Vec2.0 is used to generate robust speech representations, followed by a linear output layer to produce attribute sequences. A non-trainable pronunciation model then maps sequences of attributes into spoken keywords in a multilingual setting. Experiments on the Multilingual Spoken Words Corpus show comparable performances to character- and phoneme-based SKR in seen languages. The inclusion of domain adversarial training (DAT) improves the proposed framework, outperforming both character- and phoneme-based SKR approaches with 13.73% and 17.22% relative word error rate (WER) reduction in seen languages, and achieves 32.14% and 19.92%…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Speech Recognition and Synthesis
MethodsSparse Evolutionary Training
