Compression of end-to-end non-autoregressive image-to-speech system for   low-resourced devices

Gokul Srinivasagan; Michael Deisher; Munir Georges

arXiv:2312.00174·eess.AS·December 4, 2023·1 cites

Compression of end-to-end non-autoregressive image-to-speech system for low-resourced devices

Gokul Srinivasagan, Michael Deisher, Munir Georges

PDF

Open Access

TL;DR

This paper presents a compressed end-to-end image-to-speech system using vision transformers and knowledge distillation, enabling efficient deployment on low-resource devices with minimal performance loss.

Contribution

It introduces a novel vision transformer-based encoder and applies knowledge distillation to significantly reduce model size for embedded devices.

Findings

01

Model size reduced from 6.1M to 2.46M parameters

02

Inference speed increased by 22%

03

Minimal performance drop in human and automatic evaluations

Abstract

People with visual impairments have difficulty accessing touchscreen-enabled personal computing devices like mobile phones and laptops. The image-to-speech (ITS) systems can assist them in mitigating this problem, but their huge model size makes it extremely hard to be deployed on low-resourced embedded devices. In this paper, we aim to overcome this challenge by developing an efficient endto-end neural architecture for generating audio from tiny segments of display content on low-resource devices. We introduced a vision transformers-based image encoder and utilized knowledge distillation to compress the model from 6.1 million to 2.46 million parameters. Human and automatic evaluation results show that our approach leads to a very minimal drop in performance and can speed up the inference time by 22%.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Image Enhancement Techniques · Video Analysis and Summarization

MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings · Knowledge Distillation