Extremely Low Footprint End-to-End ASR System for Smart Device
Zhifu Gao, Yiwu Yao, Shiliang Zhang, Jun Yang, Ming Lei, Ian, McLoughlin

TL;DR
This paper presents a highly efficient end-to-end speech recognition system optimized for smart devices, combining model compression and weight sharing to drastically reduce size with minimal accuracy loss.
Contribution
It introduces a novel low-footprint E2E ASR model using cross-layer weight sharing and compression techniques, enabling deployment on resource-constrained devices.
Findings
Achieves over 10x model size reduction on AISHELL-2
Maintains near-original accuracy with only 0.43% CER increase
Demonstrates effective deployment feasibility on smart devices
Abstract
Recently, end-to-end (E2E) speech recognition has become popular, since it can integrate the acoustic, pronunciation and language models into a single neural network, which outperforms conventional models. Among E2E approaches, attention-based models, e.g. Transformer, have emerged as being superior. Such models have opened the door to deployment of ASR on smart devices, however they still suffer from requiring a large number of model parameters. We propose an extremely low footprint E2E ASR system for smart devices, to achieve the goal of satisfying resource constraints without sacrificing recognition accuracy. We design cross-layer weight sharing to improve parameter efficiency and further exploit model compression methods including sparsification and quantization, to reduce memory storage and boost decoding efficiency. We evaluate our approaches on the public AISHELL-1 and AISHELL-2…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Music and Audio Processing · Speech and Audio Processing
