Comparison of Different Neural Network Architectures for Spoken Language Identification

Conference: Speech Communication - 15th ITG Conference
09/20/2023 - 09/22/2023 at Aachen

doi:10.30420/456164014

Proceedings: ITG-Fb. 312: Speech Communication

Pages: 5Language: englishTyp: PDF

Authors:
Bazazo, Tala (Human Language Technology and Pattern Recognition, RWTH Aachen University, Germany & eBay, Aachen, Germany)
Zeineldeen, Mohammad; Schlueter, Ralf; Ney, Hermann (Human Language Technology and Pattern Recognition, RWTH Aachen University, Germany)
Plahl, Christian (eBay, Aachen, Germany)

Abstract:
This paper compares different neural network based architectures on the spoken language identification task. To our best knowledge such a comparison of different models on the same dataset and the same set of languages does not yet exist. We incorporate 7 different models which include the latest architectures: a spectral images based Resnet model, a Convolutional Neural Network, a Bi-directional Long Short-Term Memory, a Convolutional Recurrent Neural Network, Wav2Vec 2.0, a transformer and a conformer. We also tackle audio with background noise and music by training on data with similar accoustics. We finally also show that our models generalize well on third-party data.