Abstract: End-to-end text-to-speech (TTS) can synthesize monolingual speech with high naturalness and intelligibility. Recently, the end-to-end model has also been used in code-switching (CS) TTS and performs well on naturalness, intelligibility and speaker consistency. However, existing systems rely on skillful bilingual speakers to build a CS mix-lingual data set with a high Language-Mix-Ratio (LMR), while simply mixing monolingual data sets results in accent problems. To reduce the cost of recording and maintain the speaker consistency, in this paper, we investigate an effective method to use a low LMR imbalanced mix-lingual data set. Experiments show that it is possible to construct a CS TTS system with a low LMR imbalanced mix-lingual data set with diverse input text presentations, meanwhile produce acceptable synthetic CS speech with more than 4.0 Mean Opinion Score (MOS). We also find that the result will be improved if the mix-lingual data set is augmented with monolingual English data.
Text Represenations
PY-AP: Tonal pinyin for mandarin and alphabet for English.
PY-UP: Tonal pinyin for mandarin and uppercase for English..
PY-PY: Tonal pinyin for both Mandarin and English.
PY-PH: Tonal pinyin for mandarin and CMU-phonemes for English.