CML-TTS: A Multilingual Dataset for Speech Synthesis in Low-Resource Languages

• GitHub • • Paper on arXiv • • Colab Demo •

CML-TTS: A Multilingual Dataset for Speech Synthesis in Low-Resource Languages

Frederico S. Oliveira, Edresson Casanova, Arnaldo Cândido Júnior, Anderson S. Soares, and Arlindo R. Galvão Filho

Abstract:

In this paper, we present CML-TTS, a recursive acronym for CML-Multi-Lingual-TTS, a new Text-to-Speech (TTS) dataset developed at the Center of Excellence in Artificial Intelligence (CEIA) of the Federal University of Goias (UFG). CML-TTS is based on Multilingual LibriSpeech (MLS) and adapted for training TTS models, consisting of audiobooks in seven languages: Dutch, French, German, Italian, Portuguese, Polish, and Spanish. Additionally, we provide the YourTTS model, a multi-lingual TTS model, trained using 3,176.13 hours from CML-TTS and also with 245.07 hours from LibriTTS, in English. Our purpose in creating this dataset is to open up new research possibilities in the TTS area for multi-lingual models. The dataset is available for download at https://github.com/freds0/CML-TTS-Dataset under the CC-BY 4.0 license.

Download

You can download in http://www.openslr.org/146/, or each version separately:

CML-TTS Dataset Dutch - MD5: 56e11612ffea33282eced3d499cbb1ca
CML-TTS Dataset French - MD5: 410f8e144fa1a5c8e771b08b2e555a9b
CML-TTS Dataset German - MD5: 263782ee31981b101c29d09b058361e2
CML-TTS Dataset Italian - MD5: bb6160b8ee968ac8caa5a32ec4bd91ba
CML-TTS Dataset Polish - MD5: 88e6ead2d4df5f29e080d7cd37dcdbdd
CML-TTS Dataset Portuguese - MD5: 8c877a4be0eb41f275497609df5a114c
CML-TTS Dataset Spanish - MD5: afdd6c348d8d1ee693ea7fcf2ea57b9c
CML-TTS Dataset Segments - MD5: f529a908aba26a6d891b4fb17ab3125b

Statistics

CML-TTS is a dataset comprising audiobooks sourced from the public domain books of Project Gutenberg, read by volunteers from the LibriVox project. The dataset includes recordings in Dutch, German, French, Italian, Polish, Portuguese, and Spanish, all at a sampling rate of 24kHz. The following figure shows pie charts indicating the percentage of each language's duration (on the left), sample quality percentage (in the center), and the percentage of speakers' gender (on the right).

The table below displays the total duration of each language subset present in the CML-TTS dataset, as well as the duration of the Train, Test, and Dev sets. Additionally, the table provides the duration of the sets categorized by speaker gender.

Dutch Samples

Audio Samples for CML-TTS Dutch

Speaker	Ground Truth	YourTTS
880


2825


3798


11290

French Samples

Audio Samples for CML-TTS French

Speaker	Ground Truth	YourTTS
1591


4193


4482


7400

German Samples

Audio Samples for CML-TTS German

Speaker	Ground Truth	YourTTS
135


3363


7120


9494

Italian Samples

Audio Samples for CML-TTS Italian

Speaker	Ground Truth	YourTTS
4712


6698


7458


8582

Polish Samples

Audio Samples for CML-TTS Polish

Speaker	Ground Truth	YourTTS
1889


7775


8758


11295

Portuguese Samples

Audio Samples for CML-TTS Portuguese

Speaker	Ground Truth	YourTTS
4067


6549


11247


12710

Spanish Samples

Audio Samples for CML-TTS Spanish

Speaker	Ground Truth	YourTTS
1075


3503


5691


7510

Citation

@InProceedings{Cmltts2023,
    title="CML-TTS: A Multilingual Dataset for Speech Synthesis in Low-Resource Languages",
    author="Oliveira, Frederico S. and Casanova, Edresson and Junior, Arnaldo Candido and Soares, Anderson S. and Galv{\~a}o Filho, Arlindo R.", 
    editor="Ek{\v{s}}tein, Kamil and P{\'a}rtl, Franti{\v{s}}ek and Konop{\'i}k, Miloslav",
    booktitle="Text, Speech, and Dialogue",
    year="2023",
    publisher="Springer Nature Switzerland",
    address="Cham",
    pages="188--199",
    isbn="978-3-031-40498-6"
}

CML-TTS Dataset