CIF-T: A Novel CIF-based Transducer Architecture for Automatic Speech   Recognition

Tian-Hao Zhang; Dinghao Zhou; Guiping Zhong; Jiaming Zhou; Baoxiang Li

arXiv:2307.14132·cs.SD·November 28, 2024

CIF-T: A Novel CIF-based Transducer Architecture for Automatic Speech Recognition

Tian-Hao Zhang, Dinghao Zhou, Guiping Zhong, Jiaming Zhou, Baoxiang Li

PDF

Open Access

TL;DR

This paper introduces CIF-T, a new speech recognition model combining CIF mechanism with RNN-T to reduce computation and enhance predictor network role, achieving state-of-the-art results on multiple datasets.

Contribution

The paper presents CIF-T, a novel architecture that replaces RNN-T loss with CIF-based alignment, reducing computation and improving predictor network utilization.

Findings

01

CIF-T achieves state-of-the-art accuracy on AISHELL-1 and WenetSpeech datasets.

02

CIF-T reduces computational overhead compared to traditional RNN-T models.

03

The proposed enhancements improve speech recognition performance.

Abstract

RNN-T models are widely used in ASR, which rely on the RNN-T loss to achieve length alignment between input audio and target sequence. However, the implementation complexity and the alignment-based optimization target of RNN-T loss lead to computational redundancy and a reduced role for predictor network, respectively. In this paper, we propose a novel model named CIF-Transducer (CIF-T) which incorporates the Continuous Integrate-and-Fire (CIF) mechanism with the RNN-T model to achieve efficient alignment. In this way, the RNN-T loss is abandoned, thus bringing a computational reduction and allowing the predictor network a more significant role. We also introduce Funnel-CIF, Context Blocks, Unified Gating and Bilinear Pooling joint network, and auxiliary training strategy to further improve performance. Experiments on the 178-hour AISHELL-1 and 10000-hour WenetSpeech datasets show that…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing