Compression of end-to-end non-autoregressive image-to-speech system for low-resourced devices
Gokul Srinivasagan, Michael Deisher, Munir Georges

TL;DR
This paper presents a compressed end-to-end image-to-speech system using vision transformers and knowledge distillation, enabling efficient deployment on low-resource devices with minimal performance loss.
Contribution
It introduces a novel vision transformer-based encoder and applies knowledge distillation to significantly reduce model size for embedded devices.
Findings
Model size reduced from 6.1M to 2.46M parameters
Inference speed increased by 22%
Minimal performance drop in human and automatic evaluations
Abstract
People with visual impairments have difficulty accessing touchscreen-enabled personal computing devices like mobile phones and laptops. The image-to-speech (ITS) systems can assist them in mitigating this problem, but their huge model size makes it extremely hard to be deployed on low-resourced embedded devices. In this paper, we aim to overcome this challenge by developing an efficient endto-end neural architecture for generating audio from tiny segments of display content on low-resource devices. We introduced a vision transformers-based image encoder and utilized knowledge distillation to compress the model from 6.1 million to 2.46 million parameters. Human and automatic evaluation results show that our approach leads to a very minimal drop in performance and can speed up the inference time by 22%.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and Audio Processing · Image Enhancement Techniques · Video Analysis and Summarization
MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings · Knowledge Distillation
