Abstract
Despite significant advances in speech processing, Portuguese remains under-resourced due to the scarcity of public, large-scale, and high-quality datasets. To address this gap, we present TAGARELA, a new dataset composed of over 8,972 hours of podcast audio, specifically curated for training automatic speech recognition (ASR) and text-to-speech (TTS) models. Notably, its scale rivals English's GigaSpeech (10kh), enabling state-of-the-art Portuguese models.
To ensure data quality, the corpus was subjected to an audio preprocessing pipeline and subsequently transcribed using a mixed strategy: we applied ASR models that were previously trained on high-fidelity transcriptions generated by proprietary APIs, ensuring a high level of initial accuracy. Finally, to validate the effectiveness of this new resource, we present ASR and TTS models trained exclusively on our dataset and evaluate their performance, demonstrating its potential to drive the development of more robust and natural speech technologies for Portuguese.
Dataset Statistics
8,972+
Hours of Audio
16,806
Podcast Episodes
2,094
Podcast Shows
13,368
Distinct Speakers
Data Distribution
By Dialect
- Brazilian Portuguese (pt-BR): 8,130 hours (91%)
- European Portuguese (pt-PT): 842 hours (9%)
By Gender
- Male speakers: 6,368 hours (70%)
- Female speakers: 2,604 hours (30%)
Segment Statistics
- Average duration: 9.30 ± 5.49 seconds
- Average words per segment: 27.69 ± 17.06 words
Dataset Subsets
Full Subset
ASR8,972 hours - Includes audio containing various types of disfluencies, designed for robust automatic speech recognition training.
Clean-Speech Subset
TTS2,800 hours - A curated speech-only subset designed for high-quality text-to-speech and speech generation tasks.
Processing Pipeline
The TAGARELA dataset was created through a comprehensive multi-stage pipeline designed to ensure high quality and consistency:
Audio Standardization
All audio converted to FLAC format with 16kHz sample rate, 16-bit depth, mono channel.
Segmentation
Long-form recordings segmented into 5-20 second clips at natural silence points to maintain speech cohesiveness.
Speaker Diarization
Applied pyannote framework to identify and label speech segments for each speaker individually.
Overlapping Speech Detection
Trained Wav2vec2-XLS-R classifier to identify and discard segments with overlapping speech.
Transcription Generation
Bootstrap strategy using ElevenLabs Scribe for seed corpus, then fine-tuned Whisper large-v3 for pseudo-labeling with quality filtering via Wav2vec2-XLS-R agreement.
Quality Enhancement
Vocos vocoder repurposed as denoiser to remove background noise, hiss, and light reverberation.
Benchmark Results
Automatic Speech Recognition (ASR)
Models trained on TAGARELA and evaluated on Common Voice 17.0 (pt) test set:
| Model | WER (%) |
|---|---|
| Canary-1B-Flash | 7.8 |
| Distil-Whisper | 9.2 |
| Parakeet TDT | 12.3 |
Text-to-Speech (TTS)
Models trained on the 2,800-hour clean-speech subset:
| Model | CER (%) | WER (%) | MOS |
|---|---|---|---|
| Orpheus-TTS | 19.32 ± 31.64 | 26.81 ± 35.57 | 4.00 ± 0.94 |
| Chatterbox | 23.73 ± 26.17 | 31.50 ± 30.05 | 4.53 ± 0.25 |
License
The TAGARELA dataset is released under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0) license. This means you are free to:
- Share — copy and redistribute the material in any medium or format.
- Adapt — remix, transform, and build upon the material.
Under the following terms:
- Attribution (BY) — You must give appropriate credit, provide a link to the license, and indicate if changes were made.
- NonCommercial (NC) — You may not use the material for commercial purposes.
- ShareAlike (SA) — If you remix, transform, or build upon the material, you must distribute your contributions under the same license.
Citation
If you use the TAGARELA dataset in your research, please cite:
Authors
Acknowledgements
This work has been fully funded by the project Research and Development of Algorithms for Construction of Digital Human Technological Components supported by the Advanced Knowledge Center in Immersive Technologies (AKCIT) in partnership with the Federal University of Goiás (UFG).