ASR Error Correction with Constrained Decoding on Operation Prediction
Jingyuan Yang, Rongjun Li, Wei Peng

TL;DR
This paper introduces a constrained decoding method for ASR error correction that predicts correction operations to reduce latency and improve inference speed without sacrificing accuracy, supported by experiments on public datasets.
Contribution
It proposes a novel operation prediction-based correction method with a predictor module, significantly reducing decoding latency while maintaining accuracy, and releases a benchmark dataset for ASR correction.
Findings
Inference speed increased by 3.4 to 5.7 times.
WER reduced by up to 1.69%.
Effective on multiple datasets.
Abstract
Error correction techniques remain effective to refine outputs from automatic speech recognition (ASR) models. Existing end-to-end error correction methods based on an encoder-decoder architecture process all tokens in the decoding phase, creating undesirable latency. In this paper, we propose an ASR error correction method utilizing the predictions of correction operations. More specifically, we construct a predictor between the encoder and the decoder to learn if a token should be kept ("K"), deleted ("D"), or changed ("C") to restrict decoding to only part of the input sequence embeddings (the "C" tokens) for fast inference. Experiments on three public datasets demonstrate the effectiveness of the proposed approach in reducing the latency of the decoding process in ASR correction. It enhances the inference speed by at least three times (3.4 and 5.7 times) while maintaining the same…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing
MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings
