SoftCorrect: Error Correction with Soft Detection for Automatic Speech Recognition
Yichong Leng, Xu Tan, Wenjie Liu, Kaitao Song, Rui Wang, Xiang-Yang, Li, Tao Qin, Edward Lin, Tie-Yan Liu

TL;DR
SoftCorrect introduces a novel soft error detection mechanism for automatic speech recognition error correction, improving accuracy by focusing correction efforts on likely incorrect words using a dedicated language model.
Contribution
The paper proposes SoftCorrect, a new error correction method that combines explicit error detection with a constrained CTC loss, outperforming previous approaches in accuracy and speed.
Findings
Achieves 26.1% CER reduction on AISHELL-1
Achieves 9.4% CER reduction on Aidatatang
Outperforms previous error correction methods
Abstract
Error correction in automatic speech recognition (ASR) aims to correct those incorrect words in sentences generated by ASR models. Since recent ASR models usually have low word error rate (WER), to avoid affecting originally correct tokens, error correction models should only modify incorrect words, and therefore detecting incorrect words is important for error correction. Previous works on error correction either implicitly detect error words through target-source attention or CTC (connectionist temporal classification) loss, or explicitly locate specific deletion/substitution/insertion errors. However, implicit error detection does not provide clear signal about which tokens are incorrect and explicit error detection suffers from low detection accuracy. In this paper, we propose SoftCorrect with a soft error detection mechanism to avoid the limitations of both explicit and implicit…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsSpeech Recognition and Synthesis · Natural Language Processing Techniques · Music and Audio Processing
MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings · Connectionist Temporal Classification Loss
