Uncertainty-Driven Hybrid Fusion for Audio-Visual Phoneme Recognition
Conference: Speech Communication - 15th ITG Conference
09/20/2023 - 09/22/2023 at Aachen
doi:10.30420/456164050
Proceedings: ITG-Fb. 312: Speech Communication
Pages: 5Language: englishTyp: PDF
Authors:
Fang, Huajian; Gerkmann, Timo (Signal Processing, Universität Hamburg, Germany)
Frintrop, Simone (Computer Vision, Universität Hamburg, Germany)
Abstract:
For several speech-processing tasks, complementary features from the visual modality may improve model performance. However, unreliable visual input may provide misleading information, resulting in degraded performance that may be even worse than methods based solely on the audio modality. In this work, we propose an uncertainty-driven hybrid fusion scheme for audio-visual phoneme recognition, mitigating the impact of an unreliable visual modality. More specifically, we incorporate modality-wise uncertainty into decision-making, enabling the model to adaptively determine whether to combine multiple modalities and the extent to which the decision depends on each modality. Experimental results show that the proposed uncertainty-driven hybrid fusion scheme retains the benefits of multi-modal approaches when visual inputs are clean and informative, while at the same time being robust to visual modality distortions.