Audio samples from "Improving Prosody for Unseen Texts in Speech Synthesis by Utilizing Linguistic Information and Noisy Data"

Paper: arXiv

Authors: Zhu Li, Yuqing Zhang, Mengxi Nie, Ming Yan, Mengnan He, Ruixiong Zhang, Caixia Gong

Abstract: Recent advancements in end-to-end speech synthesis have made it possible to generate highly natural speech. However, training these models typically requires a large amount of high-fidelity speech data, and for unseen texts, the prosody of synthesized speech is relatively unnatural. To address these issues, we propose to combine a fine-tuned BERT-based front-end with a pre-trained FastSpeech2-based acoustic model to improve prosody modeling. The pre-trained BERT is fine-tuned on the polyphone disambiguation task, the joint Chinese word segmentation (CWS) and part-of-speech (POS) tagging task, and the prosody structure prediction (PSP) task in a multi-task learning framework. FastSpeech 2 is pre-trained on large-scale external data that are noisy but easier to obtain. Experimental results show that both the fine-tuned BERT model and the pre-trained FastSpeech 2 can improve prosody, especially for those structurally complex sentences.

Models:

1. Base: a front-end with the original FastSpeech 2 (Mel + MB-MelGAN).
2. w/ BERT: a BERT-based front-end with the original FastSpeech 2 (Mel + MB-MelGAN).
3. w/ BERT-CWS+POS: a fine-tuned BERT-based front-end with the original FastSpeech 2 (Mel + MB-MelGAN). BERT-CWS+POS means BERT encoder with a prediction layer of the joint CWS and POS tagging task.
4. w/ BERT-Prosody: a fine-tuned BERT-based front-end with the original FastSpeech 2 (Mel + MB-MelGAN). BERT-Prosody means BERT encoder with a predictor layer of the prosody structure prediction task.
5. w/ BERT-Multi: a fine-tuned BERT-based front-end with the original FastSpeech 2 (Mel + MB-MelGAN). BERT-Prosody means BERT encoder with three prediction layers (polyphone disambiguation, joint CWS and POS tagging, prosody structure prediction) attached in parallel and BERT embeddings are fine-tuned in a multi-task learning framework.
6. w/ clean: a front-end with a pre-trained FastSpeech2-based acoustic model using clean data (Mel + MB-MelGAN).
7. w/ noisy: a front-end with a pre-trained FastSpeech2-based acoustic model using noisy data (Mel + MB-MelGAN).

All of the phrases below are unseen during training.

1. 但是发型还要改进一点。

Base	w/ BERT

w/ BERT-CWS+POS	w/ BERT-Prosody	w/ BERT-Multi

w/ clean	w/ noisy

2. 商家允诺三天送货到位。

Base	w/ BERT

w/ BERT-CWS+POS	w/ BERT-Prosody	w/ BERT-Multi

w/ clean	w/ noisy

3. 他们繁育的蔬菜种子远销全国各地。

Base	w/ BERT

w/ BERT-CWS+POS	w/ BERT-Prosody	w/ BERT-Multi

w/ clean	w/ noisy

Pairwise Comparison of different models:

1. Base vs. w/ BERT

		Base	w/ BERT
Sample 1

		Base	w/ BERT
Sample 2

2. Base vs. w/ BERT-CWS+POS

		Base	w/ BERT-CWS+POS
Sample 1

		Base	w/ BERT-CWS+POS
Sample 2

3. Base vs. w/ BERT-Prosody

		Base	w/ BERT-Prosody
Sample 1

		Base	w/ BERT-Prosody
Sample 2

4. Base vs. w/ BERT-Multi

		Base	w/ BERT-Multi
Sample 1

		Base	w/ BERT-Multi
Sample 2

5. w/ BERT vs. w/ BERT-CWS+POS

		w/ BERT	w/ BERT-CWS+POS
Sample 1

		w/ BERT	w/ BERT-CWS+POS
Sample 2

6. w/ BERT vs. w/ BERT-Prosody

		w/ BERT	w/ BERT-Prosody
Sample 1

		w/ BERT	w/ BERT-Prosody
Sample 2

7. w/ BERT vs. w/ BERT-Multi

		w/ BERT	w/ BERT-Multi
Sample 1

		w/ BERT	w/ BERT-Multi
Sample 2

8. Base vs. w/ clean

		Base	w/ clean
Sample 1

		Base	w/ clean
Sample 2

9. Base vs. w/ noisy

		Base	w/ noisy
Sample 1

		Base	w/ noisy
Sample 2

10. w /clean vs. w/ noisy

		w/ clean	w/ noisy
Sample 1

		w/ clean	w/ noisy
Sample 2