Analyzing And Improving Neural Speaker Embeddings for ASR

Konferenz: Speech Communication - 15th ITG Conference
20.09.2023-22.09.2023 in Aachen

doi:10.30420/456164040

Tagungsband: ITG-Fb. 312: Speech Communication

Seiten: 5Sprache: EnglischTyp: PDF

Autoren:
Luescher, Christoph; Xu, Jingjing; Zeineldeen, Mohammad; Schlueter, Ralf; Ney, Hermann (Machine Learning and Human Language Technology, RWTH Aachen University, Aachen, Germany & AppTek GmbH, Aachen, Germany)

Inhalt:
Neural speaker embeddings encode the speaker’s speech characteristics through a DNN model and are prevalent for speaker verification tasks. However, only a few inconclusive studies have investigated the usage of neural speaker embeddings for an ASR system. In this work, we present our efforts w.r.t integrating neural speaker embeddings into a Conformer-based hybrid HMM ASR system. For ASR, our improved embedding extraction pipeline in combination with the Weighted-Simple-Add integration method results in x-vector and c-vector reaching on par performance with i-vectors. We further analyze, compare and combine different speaker embeddings. We improve our already strong baseline by switching to one cycle learning schedule while reducing the training time. By further adding neural speaker embeddings, we gain additional improvements. This results in our best Conformer-based hybrid ASR system with speaker embeddings achieving 9.0% WER on Hub5’00 and Hub5’01 while only training on SWB 300h.