Authors: Zhu Li, Yuqing Zhang, Mengxi Nie, Ming Yan, Mengnan He, Ruixiong Zhang, Caixia Gong
Abstract: Recent advancements in end-to-end speech synthesis have made it possible to generate highly natural speech.
However, training these models typically requires a large amount of high-fidelity speech data,
and for unseen texts, the prosody of synthesized speech is relatively unnatural.
To address these issues, we propose to combine a fine-tuned BERT-based front-end
with a pre-trained FastSpeech2-based acoustic model to improve prosody modeling.
The pre-trained BERT is fine-tuned on the polyphone disambiguation task,
the joint Chinese word segmentation (CWS) and part-of-speech (POS) tagging task,
and the prosody structure prediction (PSP) task in a multi-task learning framework.
FastSpeech 2 is pre-trained on large-scale external data that are noisy but easier to obtain.
Experimental results show that both the fine-tuned BERT model and the pre-trained FastSpeech 2 can improve prosody,
especially for those structurally complex sentences.
Models:
1. Base: a front-end with the original FastSpeech 2 (Mel + MB-MelGAN).
2. w/ BERT: a BERT-based front-end with the original FastSpeech 2 (Mel + MB-MelGAN).
3. w/ BERT-CWS+POS: a fine-tuned BERT-based front-end with the original FastSpeech 2 (Mel + MB-MelGAN). BERT-CWS+POS means BERT encoder with a prediction
layer of the joint CWS and POS tagging task.
4. w/ BERT-Prosody: a fine-tuned BERT-based front-end with the original FastSpeech 2 (Mel + MB-MelGAN). BERT-Prosody means BERT encoder with a predictor
layer of the prosody structure prediction task.
5. w/ BERT-Multi: a fine-tuned BERT-based front-end with the original FastSpeech 2 (Mel + MB-MelGAN). BERT-Prosody means BERT encoder with three prediction
layers (polyphone disambiguation, joint CWS and POS tagging, prosody structure prediction) attached in parallel and BERT embeddings are fine-tuned
in a multi-task learning framework.
6. w/ clean: a front-end with a pre-trained FastSpeech2-based acoustic model using clean data (Mel + MB-MelGAN).
7. w/ noisy: a front-end with a pre-trained FastSpeech2-based acoustic model using noisy data (Mel + MB-MelGAN).
All of the phrases below are unseen during training.