Two-Dimensional Embeddings for Low-Resource Keyword Spotting Based on Dynamic Time Warping
Konferenz: Speech Communication - 14th ITG Conference
29.09.2021 - 01.10.2021 in online
Tagungsband: ITG-Fb. 298: Speech Communication
Seiten: 5Sprache: EnglischTyp: PDF
Persönliche VDE-Mitglieder erhalten auf diesen Artikel 10% Rabatt
Autoren:
Wilkinghoff, Kevin; Cornaggia-Urrigshardt, Alessia; Goekgoez, Fahrettin (Fraunhofer Institute for Communication, Information Processing and Ergonomics FKIE, Wachtberg, Germany)
Inhalt:
State-of-the-art keyword spotting systems consist of neural networks trained as classifiers or trained to extract discriminative representations, so-called embeddings. However, a sufficient amount of labeled data is needed to train such a system. Dynamic time warping is another keyword spotting approach that uses only a single sample of each keyword as patterns to be searched and thus does not require any training. In this work, we propose to combine the strengths of both keyword spotting approaches in two ways: First, an angular margin loss for training a neural network to extract two-dimensional embeddings is presented. It is shown that these embeddings can be used as features for dynamic time warping, outperforming cepstral features even when very few training samples are available. Second, dynamic time warping is applied to cepstral features to turn weak into strong labels and thus provide more labeled training data for the two-dimensional embeddings.