Improving the Naturalness of Synthesized Spectrograms for TTS Using GANBased Post-Processing
Conference: Speech Communication - 15th ITG Conference
09/20/2023 - 09/22/2023 at Aachen
doi:10.30420/456164053
Proceedings: ITG-Fb. 312: Speech Communication
Pages: 5Language: englishTyp: PDF
Authors:
Sani, Paolo; Bauer, Judith; Zalkow, Frank; Dittmar, Christian (Fraunhofer IIS, Erlangen, Germany)
Habets, Emanuel A. P. (Fraunhofer IIS, Erlangen, Germany & International Audio Laboratories Erlangen, Germany)
Abstract:
Recent text-to-speech (TTS) architectures usually synthesize speech in two stages. Firstly, an acoustic model predicts a compressed spectrogram from text input. Secondly, a neural vocoder converts the spectrogram into a time-domain audio signal. However, the synthesized spectrograms often substantially differ from real-world spectrograms. In particular, they miss fine-grained details, which is referred to as the “over-smoothing effect.” Consequently, the audio signals generated by the vocoder may contain audible artifacts. We propose a spectrogram post-processing model based on generative adversarial networks (GANs) to improve the naturalness of synthesized spectrograms. In our experiments, we use acoustic models of varying quality (yielding different degrees of artifacts) and conduct listening tests, which show that our approach can substantially improve the naturalness of synthesized spectrograms. This improvement is especially significant for highly degraded spectrograms, which miss fine-grained details or harmonic content.