A Large Dataset of Spontaneous Speech with the Accent Spoken in S\~ao   Paulo for Automatic Speech Recognition Evaluation

Rodrigo Lima; Sidney Evaldo Leal; Arnaldo Candido Junior; Sandra Maria; Alu\'isio

arXiv:2409.15350·eess.AS·September 25, 2024

A Large Dataset of Spontaneous Speech with the Accent Spoken in S\~ao Paulo for Automatic Speech Recognition Evaluation

Rodrigo Lima, Sidney Evaldo Leal, Arnaldo Candido Junior, Sandra Maria, Alu\'isio

PDF

Open Access

TL;DR

This paper introduces a large, freely available spontaneous speech corpus of São Paulo Portuguese, and evaluates automatic speech recognition models trained on it, demonstrating promising results for ASR tasks.

Contribution

The paper presents the first large spontaneous speech corpus of São Paulo Portuguese for ASR, along with baseline experiments and fine-tuned models, facilitating future research.

Findings

01

Distil-Whisper achieved a WER of 24.22% on the corpus.

02

Wav2Vec2-XLSR-53 achieved a WER of 33.73%.

03

The corpus and models are publicly available for reproducibility.

Abstract

We present a freely available spontaneous speech corpus for the Brazilian Portuguese language and report preliminary automatic speech recognition (ASR) results, using both the Wav2Vec2-XLSR-53 and Distil-Whisper models fine-tuned and trained on our corpus. The NURC-SP Audio Corpus comprises 401 different speakers (204 females, 197 males) with a total of 239.30 hours of transcribed audio recordings. To the best of our knowledge, this is the first large Paulistano accented spontaneous speech corpus dedicated to the ASR task in Portuguese. We first present the design and development procedures of the NURC-SP Audio Corpus, and then describe four ASR experiments in detail. The experiments demonstrated promising results for the applicability of the corpus for ASR. Specifically, we fine-tuned two versions of Wav2Vec2-XLSR-53 model, trained a Distil-Whisper model using our dataset with labels…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis