Language-Universal Speech Attributes Modeling for Zero-Shot Multilingual   Spoken Keyword Recognition

Hao Yen; Pin-Jui Ku; Sabato Marco Siniscalchi; Chin-Hui Lee

arXiv:2406.02488·eess.AS·June 5, 2024

Language-Universal Speech Attributes Modeling for Zero-Shot Multilingual Spoken Keyword Recognition

Hao Yen, Pin-Jui Ku, Sabato Marco Siniscalchi, Chin-Hui Lee

PDF

Open Access

TL;DR

This paper introduces a language-universal speech attribute modeling approach for zero-shot multilingual spoken keyword recognition, utilizing self-supervised representations and universal speech attributes to improve recognition accuracy across multiple languages.

Contribution

It presents a novel framework combining self-supervised speech representations with universal speech attributes and domain adversarial training for improved multilingual SKR, especially in zero-shot scenarios.

Findings

01

Achieves comparable performance to character- and phoneme-based SKR in seen languages.

02

Outperforms existing methods with significant WER reductions in both seen and unseen languages.

03

Domain adversarial training enhances the robustness and accuracy of the proposed model.

Abstract

We propose a novel language-universal approach to end-to-end automatic spoken keyword recognition (SKR) leveraging upon (i) a self-supervised pre-trained model, and (ii) a set of universal speech attributes (manner and place of articulation). Specifically, Wav2Vec2.0 is used to generate robust speech representations, followed by a linear output layer to produce attribute sequences. A non-trainable pronunciation model then maps sequences of attributes into spoken keywords in a multilingual setting. Experiments on the Multilingual Spoken Words Corpus show comparable performances to character- and phoneme-based SKR in seen languages. The inclusion of domain adversarial training (DAT) improves the proposed framework, outperforming both character- and phoneme-based SKR approaches with 13.73% and 17.22% relative word error rate (WER) reduction in seen languages, and achieves 32.14% and 19.92%…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Speech Recognition and Synthesis

MethodsSparse Evolutionary Training