Tagarela - A Portuguese speech dataset from podcasts

Frederico Santos de Oliveira; Lucas Rafael Stefanel Gris; Alef Iury Siqueira Ferreira; Augusto Seben da Rosa; Alexandre Costa Ferro Filho; Edresson Casanova; Christopher Dane Shulby; Rafael Teixeira Sousa; Diogo Fernandes Costa Silva; Anderson da Silva Soares; Arlindo Rodrigues Galv\~ao Filho

arXiv:2603.15326·cs.CL·March 17, 2026

Tagarela - A Portuguese speech dataset from podcasts

Frederico Santos de Oliveira, Lucas Rafael Stefanel Gris, Alef Iury Siqueira Ferreira, Augusto Seben da Rosa, Alexandre Costa Ferro Filho, Edresson Casanova, Christopher Dane Shulby, Rafael Teixeira Sousa, Diogo Fernandes Costa Silva, Anderson da Silva Soares

PDF

Open Access 1 Models

TL;DR

The paper introduces TAGARELA, a large-scale Portuguese speech dataset from podcasts, to support the development of ASR and TTS models, addressing resource scarcity in Portuguese speech processing.

Contribution

It provides a publicly available, high-quality, large-scale Portuguese speech dataset with validated transcriptions, enabling improved speech technology development for Portuguese.

Findings

01

ASR and TTS models trained on TAGARELA perform effectively

02

Dataset quality validated through model-based transcription accuracy

03

Public release to foster Portuguese speech technology research

Abstract

Despite significant advances in speech processing, Portuguese remains under-resourced due to the scarcity of public, large-scale, and high-quality datasets. To address this gap, we present a new dataset, named TAGARELA, composed of over 8,972 hours of podcast audio, specifically curated for training automatic speech recognition (ASR) and text-to-speech (TTS) models. Notably, its scale rivals English's GigaSpeech (10kh), enabling state-of-the-art Portuguese models. To ensure data quality, the corpus was subjected to an audio pre-processing pipeline and subsequently transcribed using a mixed strategy: we applied ASR models that were previously trained on high-fidelity transcriptions generated by proprietary APIs, ensuring a high level of initial accuracy. Finally, to validate the effectiveness of this new resource, we present ASR and TTS models trained exclusively on our dataset and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

🤗
alefiury/parakeet-tdt-0.6b-v3-ptBR-TAGARELA-onnx
model

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Respiratory and Cough-Related Research · Topic Modeling