IKFST: IOO and KOO Algorithms for Accelerated and Precise WFST-based End-to-End Automatic Speech Recognition

Zhuoran Zhuang; Ye Chen; Chao Luo; Tian-Hao Zhang; Xuewei Zhang; Jian Ma; Jiatong Shi; Wei Zhang

arXiv:2601.00160·cs.SD·January 5, 2026

IKFST: IOO and KOO Algorithms for Accelerated and Precise WFST-based End-to-End Automatic Speech Recognition

Zhuoran Zhuang, Ye Chen, Chao Luo, Tian-Hao Zhang, Xuewei Zhang, Jian Ma, Jiatong Shi, Wei Zhang

PDF

Open Access

TL;DR

This paper introduces IKFST algorithms that leverage the structural roles of blank and non-blank frames in CTC outputs to significantly accelerate WFST-based end-to-end speech recognition without sacrificing accuracy.

Contribution

The paper proposes novel decoding algorithms, Keep-Only-One and Insert-Only-One, that improve inference speed by exploiting CTC frame structures in WFST-based speech recognition.

Findings

01

Achieved state-of-the-art accuracy on multiple datasets.

02

Reduced decoding latency substantially.

03

Demonstrated compatibility with large-scale speech systems.

Abstract

End-to-end automatic speech recognition has become the dominant paradigm in both academia and industry. To enhance recognition performance, the Weighted Finite-State Transducer (WFST) is widely adopted to integrate acoustic and language models through static graph composition, providing robust decoding and effective error correction. However, WFST decoding relies on a frame-by-frame autoregressive search over CTC posterior probabilities, which severely limits inference efficiency. Motivated by establishing a more principled compatibility between WFST decoding and CTC modeling, we systematically study the two fundamental components of CTC outputs, namely blank and non-blank frames, and identify a key insight: blank frames primarily encode positional information, while non-blank frames carry semantic content. Building on this observation, we introduce Keep-Only-One and Insert-Only-One,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing