TST: Time-Sparse Transducer for Automatic Speech Recognition
Xiaohui Zhang, Mangui Liang, Zhengkun Tian, Jiangyan Yi, Jianhua, Tao

TL;DR
This paper introduces a time-sparse transducer model for speech recognition that reduces memory and computation requirements by decreasing time resolution, maintaining accuracy while significantly improving processing speed.
Contribution
The paper proposes a novel time-sparse mechanism for transducers, enabling faster speech recognition with minimal accuracy loss compared to traditional RNN-T models.
Findings
Achieves 50% of the original real-time factor with comparable accuracy to RNN-T.
Adjustable time resolution allows further speed-up to 16.54% of original with slight accuracy decrease.
Validated on Mandarin AISHELL-1 dataset.
Abstract
End-to-end model, especially Recurrent Neural Network Transducer (RNN-T), has achieved great success in speech recognition. However, transducer requires a great memory footprint and computing time when processing a long decoding sequence. To solve this problem, we propose a model named time-sparse transducer, which introduces a time-sparse mechanism into transducer. In this mechanism, we obtain the intermediate representations by reducing the time resolution of the hidden states. Then the weighted average algorithm is used to combine these representations into sparse hidden states followed by the decoder. All the experiments are conducted on a Mandarin dataset AISHELL-1. Compared with RNN-T, the character error rate of the time-sparse transducer is close to RNN-T and the real-time factor is 50.00% of the original. By adjusting the time resolution, the time-sparse transducer can also…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Handwritten Text Recognition Techniques
