Comparison of Different Neural Network Architectures for Spoken Language Identification

Konferenz: Speech Communication - 15th ITG Conference
20.09.2023-22.09.2023 in Aachen

doi:10.30420/456164014

Tagungsband: ITG-Fb. 312: Speech Communication

Seiten: 5Sprache: EnglischTyp: PDF

Autoren:
Bazazo, Tala (Human Language Technology and Pattern Recognition, RWTH Aachen University, Germany & eBay, Aachen, Germany)
Zeineldeen, Mohammad; Schlueter, Ralf; Ney, Hermann (Human Language Technology and Pattern Recognition, RWTH Aachen University, Germany)
Plahl, Christian (eBay, Aachen, Germany)

Inhalt:
This paper compares different neural network based architectures on the spoken language identification task. To our best knowledge such a comparison of different models on the same dataset and the same set of languages does not yet exist. We incorporate 7 different models which include the latest architectures: a spectral images based Resnet model, a Convolutional Neural Network, a Bi-directional Long Short-Term Memory, a Convolutional Recurrent Neural Network, Wav2Vec 2.0, a transformer and a conformer. We also tackle audio with background noise and music by training on data with similar accoustics. We finally also show that our models generalize well on third-party data.