Accidental Learners: Spoken Language Identification in Multilingual Self-Supervised Models
Travis M. Bartley, Fei Jia, Krishna C. Puvvada, Samuel Kriman, and, Boris Ginsburg

TL;DR
This paper enhances multilingual self-supervised speech models with Conformer architecture, showing they effectively encode language info in lower layers, are robust to unseen languages and environments, and achieve state-of-the-art results with fewer parameters.
Contribution
Introduces a Conformer-based approach for multilingual self-supervised speech models, demonstrating improved robustness and efficiency in language identification tasks.
Findings
Pre-trained models encode language info in lower layers.
Embeddings are robust to unseen languages and acoustic variations.
Achieves state-of-the-art results with 5x fewer parameters.
Abstract
In this paper, we extend previous self-supervised approaches for language identification by experimenting with Conformer based architecture in a multilingual pre-training paradigm. We find that pre-trained speech models optimally encode language discriminatory information in lower layers. Further, we demonstrate that the embeddings obtained from these layers are significantly robust to classify unseen languages and different acoustic environments without additional training. After fine-tuning a pre-trained Conformer model on the VoxLingua107 dataset, we achieve results similar to current state-of-the-art systems for language identification. More, our model accomplishes this with 5x less parameters. We open-source the model through the NVIDIA NeMo toolkit.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Natural Language Processing Techniques · Music and Audio Processing
