In-the-wild Speech Emotion Conversion Using Disentangled Self-Supervised Representations and Neural Vocoder-based Resynthesis

Konferenz: Speech Communication - 15th ITG Conference
20.09.2023-22.09.2023 in Aachen

doi:10.30420/456164034

Tagungsband: ITG-Fb. 312: Speech Communication

Seiten: 5Sprache: EnglischTyp: PDF

Autoren:
Raj Prabhu, Navin (Signal Processing, Universität Hamburg, Germany & Industrial and Organizational Psychology, Universität Hamburg, Germany)
Lehmann-Willenbrock, Nale (Industrial and Organizational Psychology, Universität Hamburg, Germany)
Gerkmann, Timo (Signal Processing, Universität Hamburg, Germany)

Inhalt:
Speech emotion conversion aims to convert the expressed emotion of a spoken utterance to a target emotion while preserving the lexical information and the speaker’s identity. In this work, we specifically focus on in-the-wild emotion conversion where parallel data does not exist, and the problem of disentangling lexical, speaker, and emotion information arises. In this paper, we introduce a methodology that uses self-supervised networks to disentangle the lexical, speaker, and emotional content of the utterance, and subsequently uses a HiFiGAN vocoder to resynthesise the disentangled representations to a speech signal of the targeted emotion. For better representation and to achieve emotion intensity control, we specifically focus on the arousal dimension of continuous representations, as opposed to performing emotion conversion on categorical representations. We test our methodology on the large in-the-wild MSP-Podcast dataset. Results reveal that the proposed approach is aptly conditioned on the emotional content of input speech and is capable of synthesising natural-sounding speech for a target emotion. Results further reveal that the methodology better synthesises speech for mid-scale arousal (2 to 6) than for extreme arousal (1 and 7).