PATCorrect: Non-autoregressive Phoneme-augmented Transformer for ASR   Error Correction

Ziji Zhang; Zhehui Wang; Rajesh Kamma; Sharanya Eswaran; Narayanan; Sadagopan

arXiv:2302.05040·cs.CL·June 22, 2023

PATCorrect: Non-autoregressive Phoneme-augmented Transformer for ASR Error Correction

Ziji Zhang, Zhehui Wang, Rajesh Kamma, Sharanya Eswaran, Narayanan, Sadagopan

PDF

Open Access

TL;DR

PATCorrect is a non-autoregressive model that fuses text and phoneme information to efficiently correct ASR errors, significantly reducing word error rate with low latency suitable for real-time applications.

Contribution

It introduces a novel multi-modal fusion approach in a non-autoregressive framework for ASR error correction, outperforming existing text-only methods.

Findings

01

Achieves 11.62% WER reduction on English ASR outputs.

02

Operates with inference latency in the tens of milliseconds.

03

Outperforms state-of-the-art NAR correction methods.

Abstract

Speech-to-text errors made by automatic speech recognition (ASR) systems negatively impact downstream models. Error correction models as a post-processing text editing method have been recently developed for refining the ASR outputs. However, efficient models that meet the low latency requirements of industrial grade production systems have not been well studied. We propose PATCorrect-a novel non-autoregressive (NAR) approach based on multi-modal fusion leveraging representations from both text and phoneme modalities, to reduce word error rate (WER) and perform robustly with varying input transcription quality. We demonstrate that PATCorrect consistently outperforms state-of-the-art NAR method on English corpus across different upstream ASR systems, with an overall 11.62% WER reduction (WERR) compared to 9.46% WERR achieved by other methods using text only modality. Besides, its…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Speech and dialogue systems

MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Absolute Position Encodings · Dropout · Layer Normalization · Dense Connections · Position-Wise Feed-Forward Layer · Adam · Label Smoothing