Accidental Learners: Spoken Language Identification in Multilingual   Self-Supervised Models

Travis M. Bartley; Fei Jia; Krishna C. Puvvada; Samuel Kriman; and; Boris Ginsburg

arXiv:2211.05103·eess.AS·March 14, 2023·1 cites

Accidental Learners: Spoken Language Identification in Multilingual Self-Supervised Models

Travis M. Bartley, Fei Jia, Krishna C. Puvvada, Samuel Kriman, and, Boris Ginsburg

PDF

Open Access

TL;DR

This paper enhances multilingual self-supervised speech models with Conformer architecture, showing they effectively encode language info in lower layers, are robust to unseen languages and environments, and achieve state-of-the-art results with fewer parameters.

Contribution

Introduces a Conformer-based approach for multilingual self-supervised speech models, demonstrating improved robustness and efficiency in language identification tasks.

Findings

01

Pre-trained models encode language info in lower layers.

02

Embeddings are robust to unseen languages and acoustic variations.

03

Achieves state-of-the-art results with 5x fewer parameters.

Abstract

In this paper, we extend previous self-supervised approaches for language identification by experimenting with Conformer based architecture in a multilingual pre-training paradigm. We find that pre-trained speech models optimally encode language discriminatory information in lower layers. Further, we demonstrate that the embeddings obtained from these layers are significantly robust to classify unseen languages and different acoustic environments without additional training. After fine-tuning a pre-trained Conformer model on the VoxLingua107 dataset, we achieve results similar to current state-of-the-art systems for language identification. More, our model accomplishes this with 5x less parameters. We open-source the model through the NVIDIA NeMo toolkit.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Natural Language Processing Techniques · Music and Audio Processing