Compression of end-to-end non-autoregressive image-to-speech system for lowresourced devices

Conference: Speech Communication - 15th ITG Conference
09/20/2023 - 09/22/2023 at Aachen

doi:10.30420/456164029

Proceedings: ITG-Fb. 312: Speech Communication

Pages: 5Language: englishTyp: PDF

Authors:
Srinivasagan, Gokul (Saarland University, Saarbrucken, Germany & Intel Corporation, Hillsboro, Oregon, USA)
Deisher, Michael (Intel Corporation, Hillsboro, Oregon, USA)
Georges, Munir (Intel Labs, Munich, Germany & Technische Hochschule Ingolstadt, Germany)

Abstract:
People with visual impairments have difficulty accessing touchscreen-enabled personal computing devices like mobile phones and laptops. The image-to-speech (ITS) systems can assist them in mitigating this problem, but their huge model size makes it extremely hard to be deployed on low-resourced embedded devices. In this paper, we aim to overcome this challenge by developing an efficient endto- end neural architecture for generating audio from tiny segments of display content on low-resource devices. We introduced a vision transformers-based image encoder and utilized knowledge distillation to compress the model from 6.1 million to 2.46 million parameters. Human and automatic evaluation results show that our approach leads to a very minimal drop in performance and can speed up the inference time by 22%.