CIF: Continuous Integrate-and-Fire for End-to-End Speech Recognition

Linhao Dong; Bo Xu

arXiv:1905.11235·cs.CL·February 13, 2020·6 cites

CIF: Continuous Integrate-and-Fire for End-to-End Speech Recognition

Linhao Dong, Bo Xu

PDF

Open Access 2 Repos

TL;DR

This paper introduces CIF, a novel monotonic alignment mechanism inspired by neural models, enabling efficient, online speech recognition with competitive accuracy and state-of-the-art results on benchmark datasets.

Contribution

The paper proposes CIF, a new continuous, monotonic alignment method for end-to-end speech recognition, supporting online processing and acoustic boundary detection.

Findings

01

Achieves 2.86% WER on Librispeech test-clean

02

Supports online recognition and acoustic boundary positioning

03

Sets new state-of-the-art on Mandarin telephone ASR

Abstract

In this paper, we propose a novel soft and monotonic alignment mechanism used for sequence transduction. It is inspired by the integrate-and-fire model in spiking neural networks and employed in the encoder-decoder framework consists of continuous functions, thus being named as: Continuous Integrate-and-Fire (CIF). Applied to the ASR task, CIF not only shows a concise calculation, but also supports online recognition and acoustic boundary positioning, thus suitable for various ASR scenarios. Several support strategies are also proposed to alleviate the unique problems of CIF-based model. With the joint action of these methods, the CIF-based model shows competitive performance. Notably, it achieves a word error rate (WER) of 2.86% on the test-clean of Librispeech and creates new state-of-the-art result on Mandarin telephone ASR benchmark.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Music and Audio Processing · Speech and Audio Processing