U-DiT TTS: U-Diffusion Vision Transformer for Text-to-Speech

Konferenz: Speech Communication - 15th ITG Conference
20.09.2023-22.09.2023 in Aachen

doi:10.30420/456164010

Tagungsband: ITG-Fb. 312: Speech Communication

Seiten: 5Sprache: EnglischTyp: PDF

Autoren:
Jing, Xin; Yang, Zijiang; Triantafyllopoulos, Andreas (Chair of Embedded Intelligence for Health Care & Wellbeing, University of Augsburg, Germany)
Chang, Yi; Schuller, Bjoern W. (Chair of Embedded Intelligence for Health Care & Wellbeing, University of Augsburg, Germany & GLAM – Group on Language, Audio, & Music, Imperial College London, UK)
Xie, Jiangjian (School of Technology, Beijing Forestry University, China)

Inhalt:
Recently, the adoption of Score-based Generative Models (SGMs), literally Diffusion Probabilistic Models (DPMs), has gained traction due to their ability to produce highquality synthesized neural speech in neural synthesis systems. In SGMs, the U-Net architecture and its variants have long dominated as the backbone since its first successful adoption. In this research, we propose the U-DiT architecture, exploring the potential of vision transformer architecture as the core component of the diffusion models in a TTS system. The proposed U-DiT TTS system, inherited from the best parts of U-Net and ViT, allows for great scalability and versatility across different data scales and utilizes a pretrained HiFi-GAN as the vocoder. The objective (i. e., Frechet distance) and MOS results demonstrate that our U-DiT TTS system achieves competitive performance on the single-speaker dataset LJSpeech. Our demos are publicly available at: https://eihw.github.io/u-dit-tts/