PATCorrect: Non-autoregressive Phoneme-augmented Transformer for ASR Error Correction
Ziji Zhang, Zhehui Wang, Rajesh Kamma, Sharanya Eswaran, Narayanan, Sadagopan

TL;DR
PATCorrect is a non-autoregressive model that fuses text and phoneme information to efficiently correct ASR errors, significantly reducing word error rate with low latency suitable for real-time applications.
Contribution
It introduces a novel multi-modal fusion approach in a non-autoregressive framework for ASR error correction, outperforming existing text-only methods.
Findings
Achieves 11.62% WER reduction on English ASR outputs.
Operates with inference latency in the tens of milliseconds.
Outperforms state-of-the-art NAR correction methods.
Abstract
Speech-to-text errors made by automatic speech recognition (ASR) systems negatively impact downstream models. Error correction models as a post-processing text editing method have been recently developed for refining the ASR outputs. However, efficient models that meet the low latency requirements of industrial grade production systems have not been well studied. We propose PATCorrect-a novel non-autoregressive (NAR) approach based on multi-modal fusion leveraging representations from both text and phoneme modalities, to reduce word error rate (WER) and perform robustly with varying input transcription quality. We demonstrate that PATCorrect consistently outperforms state-of-the-art NAR method on English corpus across different upstream ASR systems, with an overall 11.62% WER reduction (WERR) compared to 9.46% WERR achieved by other methods using text only modality. Besides, its…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Speech and dialogue systems
MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Absolute Position Encodings · Dropout · Layer Normalization · Dense Connections · Position-Wise Feed-Forward Layer · Adam · Label Smoothing
