Dataset Paper Models (Coming Soon)


Abstract

Despite significant advances in speech processing, Portuguese remains under-resourced due to the scarcity of public, large-scale, and high-quality datasets. To address this gap, we present TAGARELA, a new dataset composed of over 8,972 hours of podcast audio, specifically curated for training automatic speech recognition (ASR) and text-to-speech (TTS) models. Notably, its scale rivals English's GigaSpeech (10kh), enabling state-of-the-art Portuguese models.

To ensure data quality, the corpus was subjected to an audio preprocessing pipeline and subsequently transcribed using a mixed strategy: we applied ASR models that were previously trained on high-fidelity transcriptions generated by proprietary APIs, ensuring a high level of initial accuracy. Finally, to validate the effectiveness of this new resource, we present ASR and TTS models trained exclusively on our dataset and evaluate their performance, demonstrating its potential to drive the development of more robust and natural speech technologies for Portuguese.


Dataset Statistics

8,972+

Hours of Audio

16,806

Podcast Episodes

2,094

Podcast Shows

13,368

Distinct Speakers


Data Distribution

By Dialect

By Gender

Segment Statistics


Dataset Subsets

Full Subset

ASR

8,972 hours - Includes audio containing various types of disfluencies, designed for robust automatic speech recognition training.

Clean-Speech Subset

TTS

2,800 hours - A curated speech-only subset designed for high-quality text-to-speech and speech generation tasks.


Processing Pipeline

The TAGARELA dataset was created through a comprehensive multi-stage pipeline designed to ensure high quality and consistency:

1

Audio Standardization

All audio converted to FLAC format with 16kHz sample rate, 16-bit depth, mono channel.

2

Segmentation

Long-form recordings segmented into 5-20 second clips at natural silence points to maintain speech cohesiveness.

3

Speaker Diarization

Applied pyannote framework to identify and label speech segments for each speaker individually.

4

Overlapping Speech Detection

Trained Wav2vec2-XLS-R classifier to identify and discard segments with overlapping speech.

5

Transcription Generation

Bootstrap strategy using ElevenLabs Scribe for seed corpus, then fine-tuned Whisper large-v3 for pseudo-labeling with quality filtering via Wav2vec2-XLS-R agreement.

6

Quality Enhancement

Vocos vocoder repurposed as denoiser to remove background noise, hiss, and light reverberation.


Benchmark Results

Automatic Speech Recognition (ASR)

Models trained on TAGARELA and evaluated on Common Voice 17.0 (pt) test set:

Model WER (%)
Canary-1B-Flash 7.8
Distil-Whisper 9.2
Parakeet TDT 12.3

Text-to-Speech (TTS)

Models trained on the 2,800-hour clean-speech subset:

Model CER (%) WER (%) MOS
Orpheus-TTS 19.32 ± 31.64 26.81 ± 35.57 4.00 ± 0.94
Chatterbox 23.73 ± 26.17 31.50 ± 30.05 4.53 ± 0.25

License

The TAGARELA dataset is released under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0) license. This means you are free to:

Under the following terms:


Citation

If you use the TAGARELA dataset in your research, please cite:

@inproceedings{oliveira2026tagarela, title={TAGARELA - A Portuguese Speech Dataset from Podcasts}, author={Oliveira, Frederico Santos de and Gris, Lucas Rafael Stefanel and Ferreira, Alef Iury Siqueira and Rosa, Augusto Seben da and Ferro Filho, Alexandre Costa and Casanova, Edresson and Shulby, Christopher Dane and Sousa, Rafael Teixeira and Silva, Diogo Fernandes Costa and Soares, Anderson da Silva and Galv{\~a}o Filho, Arlindo Rodrigues}, booktitle={IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)}, year={2026} }

Authors

Frederico Santos de Oliveira1, Lucas Rafael Stefanel Gris2, Alef Iury Siqueira Ferreira2, Augusto Seben da Rosa3, Alexandre Costa Ferro Filho2, Edresson Casanova4, Christopher Dane Shulby5, Rafael Teixeira Sousa1, Diogo Fernandes Costa Silva2, Anderson da Silva Soares2, Arlindo Rodrigues Galvão Filho2

1 Federal University of Mato Grosso (UFMT)
2 Federal University of Goias (UFG)
3 Paulista State University (UNESP)
4 NVIDIA
5 Elsa Speak


Acknowledgements

This work has been fully funded by the project Research and Development of Algorithms for Construction of Digital Human Technological Components supported by the Advanced Knowledge Center in Immersive Technologies (AKCIT) in partnership with the Federal University of Goiás (UFG).