Missing Feature Audiovisual Speech Recognition under Real-Time Constraints

Konferenz: Sprachkommunikation 2010 - 9. ITG-Fachtagung
06.10.2010 - 08.10.2010 in Bochum, Deutschland

Tagungsband: Sprachkommunikation 2010

Seiten: 4Sprache: EnglischTyp: PDF

Persönliche VDE-Mitglieder erhalten auf diesen Artikel 10% Rabatt

Autoren:
Kolossa, Dorothea; Astudillo, Ramón Fernandez; Zeiler, Steffen; Vorwerk, Alexander; Lerch, Dennis; Chong, Jike; Orglmeister, Reinhold (Electronics and Medical Signal Processing Group, TU Berlin, 10587 Berlin, Germany)
Kolossa, Dorothea; Astudillo, Ramón Fernandez; Zeiler, Steffen; Vorwerk, Alexander; Lerch, Dennis; Chong, Jike; Orglmeister, Reinhold (Parallel Computing Laboratory, UC Berkeley, CA 94704, USA)

Inhalt:
Speech recognition under very noisy conditions can profit greatly from the addition of another modality. This is of interest for a wide range of applications, but we focus on command-and-control tasks, where a small vocabulary is necessary but needs to be correctly recognized even with negative signal-to-noise ratios. For this purpose, a robust audio front-end and a video front-end have been extended to include real-valued and binary estimates of feature reliability, respectively. This information has been integrated in an audiovisual recognizer that shows robust performance without any additional parameter tuning for a wide range of SNRs, speakers, and noise conditions. In order to obtain a method that is extensible to larger vocabularies and more complex models, the likelihood computation has been implemented in a parallelized version, leading to a measured speedup of 7.5x on an NVIDIA GT 285 processor, when compared to a sequential version running on a Core i7 processor at 2.67 GHz.