Audio samples from "Improving Prosody for Unseen Texts in Speech Synthesis by Utilizing Linguistic Information and Noisy Data"

Paper: arXiv
Authors: Zhu Li, Yuqing Zhang, Mengxi Nie, Ming Yan, Mengnan He, Ruixiong Zhang, Caixia Gong
Abstract: Recent advancements in end-to-end speech synthesis have made it possible to generate highly natural speech. However, training these models typically requires a large amount of high-fidelity speech data, and for unseen texts, the prosody of synthesized speech is relatively unnatural. To address these issues, we propose to combine a fine-tuned BERT-based front-end with a pre-trained FastSpeech2-based acoustic model to improve prosody modeling. The pre-trained BERT is fine-tuned on the polyphone disambiguation task, the joint Chinese word segmentation (CWS) and part-of-speech (POS) tagging task, and the prosody structure prediction (PSP) task in a multi-task learning framework. FastSpeech 2 is pre-trained on large-scale external data that are noisy but easier to obtain. Experimental results show that both the fine-tuned BERT model and the pre-trained FastSpeech 2 can improve prosody, especially for those structurally complex sentences.

Models:

1. Base: a front-end with the original FastSpeech 2 (Mel + MB-MelGAN).
2. w/ BERT: a BERT-based front-end with the original FastSpeech 2 (Mel + MB-MelGAN).
3. w/ BERT-CWS+POS: a fine-tuned BERT-based front-end with the original FastSpeech 2 (Mel + MB-MelGAN). BERT-CWS+POS means BERT encoder with a prediction layer of the joint CWS and POS tagging task.
4. w/ BERT-Prosody: a fine-tuned BERT-based front-end with the original FastSpeech 2 (Mel + MB-MelGAN). BERT-Prosody means BERT encoder with a predictor layer of the prosody structure prediction task.
5. w/ BERT-Multi: a fine-tuned BERT-based front-end with the original FastSpeech 2 (Mel + MB-MelGAN). BERT-Prosody means BERT encoder with three prediction layers (polyphone disambiguation, joint CWS and POS tagging, prosody structure prediction) attached in parallel and BERT embeddings are fine-tuned in a multi-task learning framework.
6. w/ clean: a front-end with a pre-trained FastSpeech2-based acoustic model using clean data (Mel + MB-MelGAN).
7. w/ noisy: a front-end with a pre-trained FastSpeech2-based acoustic model using noisy data (Mel + MB-MelGAN).

All of the phrases below are unseen during training.

1. 但是发型还要改进一点。

Base w/ BERT
w/ BERT-CWS+POS w/ BERT-Prosody w/ BERT-Multi
w/ clean w/ noisy

2. 商家允诺三天送货到位。

Base w/ BERT
w/ BERT-CWS+POS w/ BERT-Prosody w/ BERT-Multi
w/ clean w/ noisy

3. 他们繁育的蔬菜种子远销全国各地。

Base w/ BERT
w/ BERT-CWS+POS w/ BERT-Prosody w/ BERT-Multi
w/ clean w/ noisy

Pairwise Comparison of different models:

1. Base vs. w/ BERT

Base w/ BERT
Sample 1      
Base w/ BERT
Sample 2      

2. Base vs. w/ BERT-CWS+POS

Base w/ BERT-CWS+POS
Sample 1      
Base w/ BERT-CWS+POS
Sample 2      

3. Base vs. w/ BERT-Prosody

Base w/ BERT-Prosody
Sample 1      
Base w/ BERT-Prosody
Sample 2      

4. Base vs. w/ BERT-Multi

Base w/ BERT-Multi
Sample 1      
Base w/ BERT-Multi
Sample 2      

5. w/ BERT vs. w/ BERT-CWS+POS

w/ BERT w/ BERT-CWS+POS
Sample 1      
w/ BERT w/ BERT-CWS+POS
Sample 2      

6. w/ BERT vs. w/ BERT-Prosody

w/ BERT w/ BERT-Prosody
Sample 1      
w/ BERT w/ BERT-Prosody
Sample 2      

7. w/ BERT vs. w/ BERT-Multi

w/ BERT w/ BERT-Multi
Sample 1      
w/ BERT w/ BERT-Multi
Sample 2      

8. Base vs. w/ clean

Base w/ clean
Sample 1      
Base w/ clean
Sample 2      

9. Base vs. w/ noisy

Base w/ noisy
Sample 1      
Base w/ noisy
Sample 2      

10. w /clean vs. w/ noisy

w/ clean w/ noisy
Sample 1      
w/ clean w/ noisy
Sample 2