• GitHub • • Paper on arXiv • • Colab Demo •


CML-TTS: A Multilingual Dataset for Speech Synthesis in Low-Resource Languages

Frederico S. Oliveira, Edresson Casanova, Arnaldo Cândido Júnior, Anderson S. Soares, and Arlindo R. Galvão Filho

Abstract:

In this paper, we present CML-TTS, a recursive acronym for CML-Multi-Lingual-TTS, a new Text-to-Speech (TTS) dataset developed at the Center of Excellence in Artificial Intelligence (CEIA) of the Federal University of Goias (UFG). CML-TTS is based on Multilingual LibriSpeech (MLS) and adapted for training TTS models, consisting of audiobooks in seven languages: Dutch, French, German, Italian, Portuguese, Polish, and Spanish. Additionally, we provide the YourTTS model, a multi-lingual TTS model, trained using 3,176.13 hours from CML-TTS and also with 245.07 hours from LibriTTS, in English. Our purpose in creating this dataset is to open up new research possibilities in the TTS area for multi-lingual models. The dataset is available for download at https://github.com/freds0/CML-TTS-Dataset under the CC-BY 4.0 license.

Download

You can download in http://www.openslr.org/146/, or each version separately:

Statistics

CML-TTS is a dataset comprising audiobooks sourced from the public domain books of Project Gutenberg, read by volunteers from the LibriVox project. The dataset includes recordings in Dutch, German, French, Italian, Polish, Portuguese, and Spanish, all at a sampling rate of 24kHz. The following figure shows pie charts indicating the percentage of each language's duration (on the left), sample quality percentage (in the center), and the percentage of speakers' gender (on the right).

The table below displays the total duration of each language subset present in the CML-TTS dataset, as well as the duration of the Train, Test, and Dev sets. Additionally, the table provides the duration of the sets categorized by speaker gender.

Dutch Samples

Audio Samples for CML-TTS Dutch

Speaker Ground Truth YourTTS
880
2825
3798
11290

French Samples

Audio Samples for CML-TTS French

Speaker Ground Truth YourTTS
1591
4193
4482
7400

German Samples

Audio Samples for CML-TTS German

Speaker Ground Truth YourTTS
135
3363
7120
9494

Italian Samples

Audio Samples for CML-TTS Italian

Speaker Ground Truth YourTTS
4712
6698
7458
8582

Polish Samples

Audio Samples for CML-TTS Polish

Speaker Ground Truth YourTTS
1889
7775
8758
11295

Portuguese Samples

Audio Samples for CML-TTS Portuguese

Speaker Ground Truth YourTTS
4067
6549
11247
12710

Spanish Samples

Audio Samples for CML-TTS Spanish

Speaker Ground Truth YourTTS
1075
3503
5691
7510

Citation

@InProceedings{Cmltts2023,
    title="CML-TTS: A Multilingual Dataset for Speech Synthesis in Low-Resource Languages",
    author="Oliveira, Frederico S. and Casanova, Edresson and Junior, Arnaldo Candido and Soares, Anderson S. and Galv{\~a}o Filho, Arlindo R.", 
    editor="Ek{\v{s}}tein, Kamil and P{\'a}rtl, Franti{\v{s}}ek and Konop{\'i}k, Miloslav",
    booktitle="Text, Speech, and Dialogue",
    year="2023",
    publisher="Springer Nature Switzerland",
    address="Cham",
    pages="188--199",
    isbn="978-3-031-40498-6"
}