Brazilian Portuguese Speech Recognition Using Wav2vec 2.0
Lucas Rafael Stefanel Gris, Edresson Casanova, Frederico Santos de, Oliveira, Anderson da Silva Soares, Arnaldo Candido Junior

TL;DR
This paper develops a Brazilian Portuguese speech recognition system using Wav2vec 2.0, achieving the lowest open-end-to-end error rate for BP by fine-tuning a multilingual pre-trained model on open data.
Contribution
It introduces a novel open-source BP ASR system fine-tuned from a multilingual Wav2vec 2.0 model, with state-of-the-art performance among open models.
Findings
Average WER of 12.4% across datasets
WER reduces to 10.5% with language model
Achieves lowest error among open BP ASR models
Abstract
Deep learning techniques have been shown to be efficient in various tasks, especially in the development of speech recognition systems, that is, systems that aim to transcribe an audio sentence in a sequence of written words. Despite the progress in the area, speech recognition can still be considered difficult, especially for languages lacking available data, such as Brazilian Portuguese (BP). In this sense, this work presents the development of an public Automatic Speech Recognition (ASR) system using only open available audio data, from the fine-tuning of the Wav2vec 2.0 XLSR-53 model pre-trained in many languages, over BP data. The final model presents an average word error rate of 12.4% over 7 different datasets (10.5% when applying a language model). According to our knowledge, the obtained error is the lowest among open end-to-end (E2E) ASR models for BP.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Natural Language Processing Techniques · Music and Audio Processing
