Enhancing Dialogue Speech Recognition with Robust Contextual Awareness   via Noise Representation Learning

Wonjun Lee; San Kim; Gary Geunbae Lee

arXiv:2408.06043·cs.CL·August 13, 2024

Enhancing Dialogue Speech Recognition with Robust Contextual Awareness via Noise Representation Learning

Wonjun Lee, San Kim, Gary Geunbae Lee

PDF

Open Access

TL;DR

This paper proposes Context Noise Representation Learning (CNRL) to improve dialogue speech recognition by making it more robust to noisy context, especially in real-world noisy environments, through decoder pre-training and noise modeling.

Contribution

It introduces a novel CNRL approach that enhances ASR robustness by modeling noise in dialogue context and incorporates decoder pre-training for better performance.

Findings

01

Outperforms baseline models in dialogue speech recognition accuracy.

02

Shows significant improvements in noisy environments with low audibility.

03

Demonstrates the effectiveness of noise modeling and decoder pre-training.

Abstract

Recent dialogue systems rely on turn-based spoken interactions, requiring accurate Automatic Speech Recognition (ASR). Errors in ASR can significantly impact downstream dialogue tasks. To address this, using dialogue context from user and agent interactions for transcribing subsequent utterances has been proposed. This method incorporates the transcription of the user's speech and the agent's response as model input, using the accumulated context generated by each turn. However, this context is susceptible to ASR errors because it is generated by the ASR model in an auto-regressive fashion. Such noisy context can further degrade the benefits of context input, resulting in suboptimal ASR performance. In this paper, we introduce Context Noise Representation Learning (CNRL) to enhance robustness against noisy context, ultimately improving dialogue speech recognition accuracy. To maximize…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and dialogue systems · Speech Recognition and Synthesis · Speech and Audio Processing