Adding Connectionist Temporal Summarization into Conformer to Improve   Its Decoder Efficiency For Speech Recognition

Nick J.C. Wang; Zongfeng Quan; Shaojun Wang; Jing Xiao

arXiv:2204.03889·cs.SD·April 11, 2022

Adding Connectionist Temporal Summarization into Conformer to Improve Its Decoder Efficiency For Speech Recognition

Nick J.C. Wang, Zongfeng Quan, Shaojun Wang, Jing Xiao

PDF

Open Access

TL;DR

This paper introduces a connectionist temporal summarization method integrated into Conformer models to enhance speech recognition decoding efficiency, reducing computational load without sacrificing accuracy.

Contribution

The paper proposes a novel CTS technique for Conformer that decreases decoding operations and improves efficiency while maintaining or improving ASR accuracy.

Findings

01

Decoding budget reduced by up to 20% on LibriSpeech

02

Decoding budget reduced by 11% on FluentSpeech

03

WER reduced by 6% at beam width 1

Abstract

The Conformer model is an excellent architecture for speech recognition modeling that effectively utilizes the hybrid losses of connectionist temporal classification (CTC) and attention to train model parameters. To improve the decoding efficiency of Conformer, we propose a novel connectionist temporal summarization (CTS) method that reduces the number of frames required for the attention decoder fed from the acoustic sequences generated by the encoder, thus reducing operations. However, to achieve such decoding improvements, we must fine-tune model parameters, as cross-attention observations are changed and thus require corresponding refinements. Our final experiments show that, with a beamwidth of 4, the LibriSpeech's decoding budget can be reduced by up to 20% and for FluentSpeech data it can be reduced by 11%, without losing ASR accuracy. An improvement in accuracy is even found for…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing