Multi-Speaker Text-to-Speech Using ForwardTacotron with Improved Duration Prediction

Conference: Speech Communication - 15th ITG Conference
09/20/2023 - 09/22/2023 at Aachen

doi:10.30420/456164036

Proceedings: ITG-Fb. 312: Speech Communication

Pages: 5Language: englishTyp: PDF

Authors:
Kayyar Lakshminarayana, Kishor; Dittmar, Christian; Pia, Nicola (Fraunhofer Institute for Integrated Circuits (IIS), Erlangen, Germany)
Habets, Emanuel A.P. (International Audio Laboratories Erlangen1, Erlangen, Germany)

Abstract:
Several non-autoregressive methods for fast and efficient text-to-speech synthesis have been proposed. Most of these use a duration predictor to estimate the temporal sequence of phonemes in the speech. This duration prediction is based on the input phoneme sequence in a speakerindependent fashion. The resulting constant speech pace across speakers is unnatural since every human has a unique characteristic speed in talking. This paper proposes an extension of the multi-speaker ForwardTacotron to learn this aspect with trainable speaker embeddings. The duration of synthesized speech from the proposed model across multiple speakers is much closer to the durations of speech synthesized by a baseline auto-regressive model. The proposed extension yields marginal improvements in intelligibility as measured through an automated semantically unpredictable sentence test. Further, we show that the speech rhythm does not play a significant part in the perceptual quality assessment through a listening test.