Crossmodal ASR Error Correction with Discrete Speech Units

Yuanchao Li; Pinzhen Chen; Peter Bell; Catherine Lai

arXiv:2405.16677·eess.AS·September 16, 2024

Crossmodal ASR Error Correction with Discrete Speech Units

Yuanchao Li, Pinzhen Chen, Peter Bell, Catherine Lai

PDF

Open Access 1 Repo

TL;DR

This paper introduces a novel crossmodal ASR error correction method using discrete speech units, effectively improving transcript accuracy in low-resource, out-of-domain scenarios and enhancing downstream tasks like speech emotion recognition.

Contribution

It proposes a new approach incorporating discrete speech units for better error correction in low-resource, out-of-domain ASR data, with strategies for training and domain discrepancy mitigation.

Findings

01

Effective correction of ASR errors in low-resource settings

02

Improved downstream speech emotion recognition performance

03

Demonstrated generalizability across multiple corpora

Abstract

ASR remains unsatisfactory in scenarios where the speaking style diverges from that used to train ASR systems, resulting in erroneous transcripts. To address this, ASR Error Correction (AEC), a post-ASR processing approach, is required. In this work, we tackle an understudied issue: the Low-Resource Out-of-Domain (LROOD) problem, by investigating crossmodal AEC on very limited downstream data with 1-best hypothesis transcription. We explore pre-training and fine-tuning strategies and uncover an ASR domain discrepancy phenomenon, shedding light on appropriate training schemes for LROOD data. Moreover, we propose the incorporation of discrete speech units to align with and enhance the word embeddings for improving AEC quality. Results from multiple corpora and several evaluation metrics demonstrate the feasibility and efficacy of our proposed AEC approach on LROOD data as well as its…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

yc-li20/Crossmodal_AEC
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and dialogue systems · Speech Recognition and Synthesis

MethodsALIGN