TST: Time-Sparse Transducer for Automatic Speech Recognition

Xiaohui Zhang; Mangui Liang; Zhengkun Tian; Jiangyan Yi; Jianhua; Tao

arXiv:2307.08323·cs.SD·July 18, 2023

TST: Time-Sparse Transducer for Automatic Speech Recognition

Xiaohui Zhang, Mangui Liang, Zhengkun Tian, Jiangyan Yi, Jianhua, Tao

PDF

Open Access

TL;DR

This paper introduces a time-sparse transducer model for speech recognition that reduces memory and computation requirements by decreasing time resolution, maintaining accuracy while significantly improving processing speed.

Contribution

The paper proposes a novel time-sparse mechanism for transducers, enabling faster speech recognition with minimal accuracy loss compared to traditional RNN-T models.

Findings

01

Achieves 50% of the original real-time factor with comparable accuracy to RNN-T.

02

Adjustable time resolution allows further speed-up to 16.54% of original with slight accuracy decrease.

03

Validated on Mandarin AISHELL-1 dataset.

Abstract

End-to-end model, especially Recurrent Neural Network Transducer (RNN-T), has achieved great success in speech recognition. However, transducer requires a great memory footprint and computing time when processing a long decoding sequence. To solve this problem, we propose a model named time-sparse transducer, which introduces a time-sparse mechanism into transducer. In this mechanism, we obtain the intermediate representations by reducing the time resolution of the hidden states. Then the weighted average algorithm is used to combine these representations into sparse hidden states followed by the decoder. All the experiments are conducted on a Mandarin dataset AISHELL-1. Compared with RNN-T, the character error rate of the time-sparse transducer is close to RNN-T and the real-time factor is 50.00% of the original. By adjusting the time resolution, the time-sparse transducer can also…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Handwritten Text Recognition Techniques