Target-Speaker Voice Activity Detection in Multi-Talker Scenarios: An Empirical Study

Konferenz: Speech Communication - 15th ITG Conference
20.09.2023-22.09.2023 in Aachen

doi:10.30420/456164049

Tagungsband: ITG-Fb. 312: Speech Communication

Seiten: 5Sprache: EnglischTyp: PDF

Autoren:
Aloradi, Ahmad; Elminshawi, Mohamed; Chetupalli, Srikanth Raj; Habets, Emanuel A. P. (International Audio Laboratories Erlangen, Germany)

Inhalt:
Target speaker voice activity detection (TS-VAD) has recently gained increasing attention due to its wide range of applications, e.g., speaker diarization and extraction. TS-VAD is usually studied under conversational speech scenarios, wherein the speech of individual speakers is partially or entirely non-overlapping. This potentially restricts the application of TS-VAD systems to less challenging acoustic environments. In this work, we study TS-VAD for fully overlapped speech mixtures. We conduct an ablation study with Personal VAD 2.0 as the baseline to gain a deeper understanding of the choice of TS-VAD components and their effect on the detection performance. Our experiments on WSJ0-2Mix and Libri2Mix datasets show that existing TS-VAD architectures generalize to multitalker environments involving full speaker overlap. Furthermore, we found that TS-VAD performance is sensitive to the target conditioning and its fusion method with the voice activity detection network. We found multiple configurations of target conditioning and fusion methods that outperform the baseline in single- and multi-talker settings.