CSANet: Improving convolutional neural networks for CTC-based speech recognition with shrinkage attention

Conference: CIBDA 2022 - 3rd International Conference on Computer Information and Big Data Applications
03/25/2022 - 03/27/2022 at Wuhan, China

Proceedings: CIBDA 2022

Pages: 5Language: englishTyp: PDF

Authors:
Zhu, Xuechao; Gao, Lu; Hao, Bin; Yang, Lidong; Zhang, Fei (Inner Mongolia University of Science and Technology, Baotou, China)

Abstract:
Recently Convolutional neural network (CNN) based models have shown promising results in end-to-end speech recognition tasks, but the maximum performance of CNN is limited for extracting features, and it is difficult to extract acoustic features with different degrees of redundant information effectively. In this paper, we study how to compensate for this shortcoming with a novel CNN-CTC architecture, which we call CSANet. CSANet removes useless information outside the threshold region by adding shrinkage attention, which improves the feature learning ability of convolution. We also present a combination unit of adaptive coefficient and soft thresholding for optimizing shrinkage attention to effectively learn the global upper and lower information out of the threshold region. It is worth noting that the adaptive coefficients and soft thresholds are learned by backpropagation. In addition, we use 1D depth separable convolutional networks for optimizing gated convolutional networks to augment context relevance. The model is finally trained by the connectionist temporal classification (CTC) loss function. All experiments are conducted on a Mandarin Chinese dataset AISHELL-1. The results demonstrate that our proposed model achieves an 8.91% in character error rate using a beam search decoder with an external language model.