Optimal temporal dynamics of MFCCs for low-complexity VAD systems – a case study
Conference: Speech Communication - 13. ITG-Fachtagung Sprachkommunikation
10/10/2018 - 10/12/2018 at Oldenburg, Deutschland
Proceedings: Speech Communication
Pages: 5Language: englishTyp: PDF
Personal VDE Members are entitled to a 10% discount on this title
Authors:
Craciun, Alexandra; Baeckstroem, Tom (XMOS Ltd, UK, 2Aalto University, Department of Signal Processing and Acoustics, Finland)
Abstract:
Recent advances in machine learning strategies for speech classification require increasingly complex classifiers and large numbers of features. For practical application in lowresource systems, such methods use prohibitively large numbers of operations. A better approach involves reducing the features to the fewest, most salient ones, while simplifying the classifier structure to a minimum. The mel-frequency cepstral coefficients (MFCCs) are often used in speechrelated classification tasks, which suggests the compressed information therein is highly informative. They are computed by warping the spectral energy to a mel scale, followed by a logarithm and a discrete cosine transformation. To better understand the properties governing such features, we examine different MFCC configurations using a simple neural network classifier for a low-complexity voice activity detector. In particular, we investigate the optimal number of MFCCs, the extent of the required temporal information and the best compression rate for different analysis settings, with varying frequency resolutions.