Pretraining Approaches for Spoken Language Recognition: TalTech Submission to the OLR 2021 Challenge
Tanel Alum\"ae, Kunnar Kukk

TL;DR
This paper explores various pretraining strategies for spoken language recognition, demonstrating effective approaches in challenge settings and analyzing factors influencing model performance.
Contribution
It introduces a dual approach using multilingual ASR finetuning and pretrained wav2vec2.0 models for language ID, achieving top rankings in the OLR 2021 Challenge.
Findings
Multilingual ASR finetuning improves language recognition accuracy.
Pretrained wav2vec2.0 models enhance performance with external data.
Target language data quantity impacts backend model accuracy.
Abstract
This paper investigates different pretraining approaches to spoken language identification. The paper is based on our submission to the Oriental Language Recognition 2021 Challenge. We participated in two tracks of the challenge: constrained and unconstrained language recognition. For the constrained track, we first trained a Conformer-based encoder-decoder model for multilingual automatic speech recognition (ASR), using the provided training data that had transcripts available. The shared encoder of the multilingual ASR model was then finetuned for the language identification task. For the unconstrained task, we relied on both externally available pretrained models as well as external data: the multilingual XLSR-53 wav2vec2.0 model was finetuned on the VoxLingua107 corpus for the language recognition task, and finally finetuned on the provided target language training data, augmented…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Natural Language Processing Techniques · Speech and Audio Processing
