Tagarela - A Portuguese speech dataset from podcasts
Frederico Santos de Oliveira, Lucas Rafael Stefanel Gris, Alef Iury Siqueira Ferreira, Augusto Seben da Rosa, Alexandre Costa Ferro Filho, Edresson Casanova, Christopher Dane Shulby, Rafael Teixeira Sousa, Diogo Fernandes Costa Silva, Anderson da Silva Soares

TL;DR
The paper introduces TAGARELA, a large-scale Portuguese speech dataset from podcasts, to support the development of ASR and TTS models, addressing resource scarcity in Portuguese speech processing.
Contribution
It provides a publicly available, high-quality, large-scale Portuguese speech dataset with validated transcriptions, enabling improved speech technology development for Portuguese.
Findings
ASR and TTS models trained on TAGARELA perform effectively
Dataset quality validated through model-based transcription accuracy
Public release to foster Portuguese speech technology research
Abstract
Despite significant advances in speech processing, Portuguese remains under-resourced due to the scarcity of public, large-scale, and high-quality datasets. To address this gap, we present a new dataset, named TAGARELA, composed of over 8,972 hours of podcast audio, specifically curated for training automatic speech recognition (ASR) and text-to-speech (TTS) models. Notably, its scale rivals English's GigaSpeech (10kh), enabling state-of-the-art Portuguese models. To ensure data quality, the corpus was subjected to an audio pre-processing pipeline and subsequently transcribed using a mixed strategy: we applied ASR models that were previously trained on high-fidelity transcriptions generated by proprietary APIs, ensuring a high level of initial accuracy. Finally, to validate the effectiveness of this new resource, we present ASR and TTS models trained exclusively on our dataset and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Respiratory and Cough-Related Research · Topic Modeling
