Breaking Through the Spike: Spike Window Decoding for Accelerated and   Precise Automatic Speech Recognition

Wei Zhang; Tian-Hao Zhang; Chao Luo; Hui Zhou; Chao Yang; Xinyuan; Qian; Xu-Cheng Yin

arXiv:2501.03257·eess.AS·January 8, 2025

Breaking Through the Spike: Spike Window Decoding for Accelerated and Precise Automatic Speech Recognition

Wei Zhang, Tian-Hao Zhang, Chao Luo, Hui Zhou, Chao Yang, Xinyuan, Qian, Xu-Cheng Yin

PDF

Open Access

TL;DR

This paper introduces Spike Window Decoding, a novel method that accelerates end-to-end speech recognition by leveraging the spike property of CTC outputs, achieving state-of-the-art accuracy with faster inference.

Contribution

It proposes the Spike Window Decoding algorithm, which significantly speeds up WFST-based speech recognition by focusing on spiking frames, maintaining high accuracy.

Findings

01

Achieves state-of-the-art recognition accuracy.

02

Significantly accelerates decoding speed.

03

Effective across multiple datasets.

Abstract

Recently, end-to-end automatic speech recognition has become the mainstream approach in both industry and academia. To optimize system performance in specific scenarios, the Weighted Finite-State Transducer (WFST) is extensively used to integrate acoustic and language models, leveraging its capacity to implicitly fuse language models within static graphs, thereby ensuring robust recognition while also facilitating rapid error correction. However, WFST necessitates a frame-by-frame search of CTC posterior probabilities through autoregression, which significantly hampers inference speed. In this work, we thoroughly investigate the spike property of CTC outputs and further propose the conjecture that adjacent frames to non-blank spikes carry semantic information beneficial to the model. Building on this, we propose the Spike Window Decoding algorithm, which greatly improves the inference…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Speech Recognition and Synthesis

MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings · weighted finite state transducer