A light-weight and efficient punctuation and word casing prediction model for on-device streaming ASR
Jian You, Xiangfeng Li

TL;DR
This paper introduces a lightweight CNN-BiLSTM model for real-time punctuation and casing prediction in on-device streaming ASR, achieving high accuracy with minimal size and fast inference.
Contribution
It presents a novel, efficient model that outperforms non-Transformer models and rivals Transformer models in size and speed for on-device ASR punctuation and casing prediction.
Findings
9% relative improvement in F1-score over non-Transformer models
Model is 40 times smaller than Transformer models
Inference is 2.5 times faster than Transformer-based approaches
Abstract
Punctuation and word casing prediction are necessary for automatic speech recognition (ASR). With the popularity of on-device end-to-end streaming ASR systems, the on-device punctuation and word casing prediction become a necessity while we found little discussion on this. With the emergence of Transformer, Transformer based models have been explored for this scenario. However, Transformer based models are too large for on-device ASR systems. In this paper, we propose a light-weight and efficient model that jointly predicts punctuation and word casing in real time. The model is based on Convolutional Neural Network (CNN) and Bidirectional Long Short-Term Memory (BiLSTM). Experimental results on the IWSLT2011 test set show that the proposed model obtains 9% relative improvement compared to the best of non-Transformer models on overall F1-score. Compared to the representative of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsService-Oriented Architecture and Web Services
MethodsSparse Evolutionary Training · Residual Connection · Byte Pair Encoding · Layer Normalization · Label Smoothing · Linear Layer · Adam · Dropout · Multi-Head Attention · Dense Connections
