Evaluation Metrics for Generative Speech Enhancement Methods: Issues and Perspectives
Conference: Speech Communication - 15th ITG Conference
09/20/2023 - 09/22/2023 at Aachen
doi:10.30420/456164052
Proceedings: ITG-Fb. 312: Speech Communication
Pages: 5Language: englishTyp: PDF
Authors:
Pirklbauer, Jan; Sach, Marvin; Fingscheidt, Tim (Institute for Communications Technology, TU Braunschweig, Braunschweig, Germany)
Fluyt, Kristoff; Tirry, Wouter (Goodix Technology (Belgium) BV, Leuven, Belgium)
Wardah, Wafaa; Moeller, Sebastian (Quality & Usability Lab, TU Berlin, Germany)
Abstract:
Generative speech enhancement methods commonly employ components of text-to-speech (TTS) systems to suppress noise and enhance speech quality. They have won traction recently, as they allow for a clean, virtually noisefree speech estimate. However, they come with unique error types such as mumbled speech and substituted phonemes, which are often not recognized by common nonintrusive speech quality metrics such as NISQA and DNSMOS. Intrusive metrics, such as PESQ and STOI on the other hand, are also not reliable due to their dependence on audio similarity and therefore rarely adopted in TTS research. In this work, we provide insights into typical issues of instrumental evaluation of generative approaches to speech enhancement. Furthermore, we propose the Levenshtein phoneme distance (LPD) that helps to catch and interpret the unique error types evoked by generative approaches. Finally, we propose best practices for interpreting metrics for generative approaches, pointing out that PESQ is indeed useful for the evaluation of generative speech enhancement in low-SNR conditions, while NISQA and DNSMOS are good in mid to high SNR.