CML-TTS: A Multilingual Dataset for Speech Synthesis in Low-Resource Languages
Frederico S. Oliveira, Edresson Casanova, Arnaldo Cândido Júnior, Anderson S. Soares, and Arlindo R. Galvão Filho
Abstract:
In this paper, we present CML-TTS, a recursive acronym for CML-Multi-Lingual-TTS, a new Text-to-Speech (TTS) dataset developed at the Center of Excellence in Artificial Intelligence (CEIA) of the Federal University of Goias (UFG). CML-TTS is based on Multilingual LibriSpeech (MLS) and adapted for training TTS models, consisting of audiobooks in seven languages: Dutch, French, German, Italian, Portuguese, Polish, and Spanish. Additionally, we provide the YourTTS model, a multi-lingual TTS model, trained using 3,176.13 hours from CML-TTS and also with 245.07 hours from LibriTTS, in English. Our purpose in creating this dataset is to open up new research possibilities in the TTS area for multi-lingual models. The dataset is available for download at https://github.com/freds0/CML-TTS-Dataset under the CC-BY 4.0 license.
Download
You can download in http://www.openslr.org/146/, or each version separately:
- CML-TTS Dataset Dutch - MD5: 56e11612ffea33282eced3d499cbb1ca
- CML-TTS Dataset French - MD5: 410f8e144fa1a5c8e771b08b2e555a9b
- CML-TTS Dataset German - MD5: 263782ee31981b101c29d09b058361e2
- CML-TTS Dataset Italian - MD5: bb6160b8ee968ac8caa5a32ec4bd91ba
- CML-TTS Dataset Polish - MD5: 88e6ead2d4df5f29e080d7cd37dcdbdd
- CML-TTS Dataset Portuguese - MD5: 8c877a4be0eb41f275497609df5a114c
- CML-TTS Dataset Spanish - MD5: afdd6c348d8d1ee693ea7fcf2ea57b9c
- CML-TTS Dataset Segments - MD5: f529a908aba26a6d891b4fb17ab3125b
Statistics
CML-TTS is a dataset comprising audiobooks sourced from the public domain books of Project Gutenberg, read by volunteers from the LibriVox project. The dataset includes recordings in Dutch, German, French, Italian, Polish, Portuguese, and Spanish, all at a sampling rate of 24kHz. The following figure shows pie charts indicating the percentage of each language's duration (on the left), sample quality percentage (in the center), and the percentage of speakers' gender (on the right).


Dutch Samples
Audio Samples for CML-TTS Dutch
Speaker | Ground Truth | YourTTS |
---|---|---|
880 | ||
2825 | ||
3798 | ||
11290 | ||
French Samples
Audio Samples for CML-TTS French
Speaker | Ground Truth | YourTTS |
---|---|---|
1591 | ||
4193 | ||
4482 | ||
7400 | ||
German Samples
Audio Samples for CML-TTS German
Speaker | Ground Truth | YourTTS |
---|---|---|
135 | ||
3363 | ||
7120 | ||
9494 | ||
Italian Samples
Audio Samples for CML-TTS Italian
Speaker | Ground Truth | YourTTS |
---|---|---|
4712 | ||
6698 | ||
7458 | ||
8582 | ||
Polish Samples
Audio Samples for CML-TTS Polish
Speaker | Ground Truth | YourTTS |
---|---|---|
1889 | ||
7775 | ||
8758 | ||
11295 | ||
Portuguese Samples
Audio Samples for CML-TTS Portuguese
Speaker | Ground Truth | YourTTS |
---|---|---|
4067 | ||
6549 | ||
11247 | ||
12710 | ||
Spanish Samples
Audio Samples for CML-TTS Spanish
Speaker | Ground Truth | YourTTS |
---|---|---|
1075 | ||
3503 | ||
5691 | ||
7510 | ||
Citation
@InProceedings{Cmltts2023, title="CML-TTS: A Multilingual Dataset for Speech Synthesis in Low-Resource Languages", author="Oliveira, Frederico S. and Casanova, Edresson and Junior, Arnaldo Candido and Soares, Anderson S. and Galv{\~a}o Filho, Arlindo R.", editor="Ek{\v{s}}tein, Kamil and P{\'a}rtl, Franti{\v{s}}ek and Konop{\'i}k, Miloslav", booktitle="Text, Speech, and Dialogue", year="2023", publisher="Springer Nature Switzerland", address="Cham", pages="188--199", isbn="978-3-031-40498-6" }