Transfer Learning using Musical/Non-Musical Mixtures for Multi-Instrument Recognition

Conference: Speech Communication - 15th ITG Conference
09/20/2023 - 09/22/2023 at Aachen

doi:10.30420/456164009

Proceedings: ITG-Fb. 312: Speech Communication

Pages: 5Language: englishTyp: PDF

Authors:
Bradl, Hannes (Joanneum Research Forschungsgesellschaft mbH, Austria)
Huber, Markus (sonible GmbH, Graz, Austria)
Pernkopf, Franz (Christian Doppler Laboratory for Dependable Intelligent Systems in Harsh Environments, Signal Processing and Speech Communication Lab., Graz University of Technology, Austria)

Abstract:
Datasets for most music information retrieval (MIR) tasks tend to be relatively small. However, in deep learning, insufficient training data often leads to poor performance. Typically, this problem is approached by transfer learning (TL) and data augmentation. In this work, we compare various of these methods for the task of multi-instrument recognition. A convolutional neural network (CNN) is able to identify eight instrument families and seven specific instruments from polyphonic music recordings. Training is conducted in two phases: After pre-training with a music tagging dataset, the CNN is retrained using multi-track data. Experimenting with different TL methods suggests that training the final fully-connected layers from scratch while fine-tuning the convolutional backbone yields the best performance. Two different mixing strategies – musical and non-musical mixing – are investigated. It turns out that a blend of both mixing strategies works best for multi-instrument recognition.