Stream-ETS: Low-latency End-to-end Speech Synthesis from Electromyography Signals
Conference: Speech Communication - 15th ITG Conference
09/20/2023 - 09/22/2023 at Aachen
doi:10.30420/456164039
Proceedings: ITG-Fb. 312: Speech Communication
Pages: 5Language: englishTyp: PDF
Authors:
Scheck, Kevin; Ivucic, Darius; Ren, Zhao; Schultz, Tanja (Cognitive Systems Lab, University of Bremen, Germany)
Abstract:
The electromyographic activity of articulatory muscles provides information about the speech production process. As such, Electromyography (EMG) signals are investigated for speech communication methods without acoustic speech in the context of Silent Speech Interfaces. For this, EMG-to-Speech (ETS) models predict acoustic speech from EMG signals captured during articulation. In this work, we propose Stream-ETS, a streamable end-to-end ETS system. Its architecture consists of a causal EMG encoder, processing EMG signals to Mel-spectrograms, and a causal neural vocoder, which predicts the acoustic speech signal. Using a GPU, Stream-ETS outputs acoustic speech from 10 millisecond chunks of EMG in approx. 8 milliseconds, making the system perform in real-time with a low-latency. We first pre-train both components and then perform end-to-end fine-tuning. Experiments indicate that end-to-end training increases the naturalness of the speech synthesis.