Enhancing Dialogue Speech Recognition with Robust Contextual Awareness via Noise Representation Learning
Wonjun Lee, San Kim, Gary Geunbae Lee

TL;DR
This paper proposes Context Noise Representation Learning (CNRL) to improve dialogue speech recognition by making it more robust to noisy context, especially in real-world noisy environments, through decoder pre-training and noise modeling.
Contribution
It introduces a novel CNRL approach that enhances ASR robustness by modeling noise in dialogue context and incorporates decoder pre-training for better performance.
Findings
Outperforms baseline models in dialogue speech recognition accuracy.
Shows significant improvements in noisy environments with low audibility.
Demonstrates the effectiveness of noise modeling and decoder pre-training.
Abstract
Recent dialogue systems rely on turn-based spoken interactions, requiring accurate Automatic Speech Recognition (ASR). Errors in ASR can significantly impact downstream dialogue tasks. To address this, using dialogue context from user and agent interactions for transcribing subsequent utterances has been proposed. This method incorporates the transcription of the user's speech and the agent's response as model input, using the accumulated context generated by each turn. However, this context is susceptible to ASR errors because it is generated by the ASR model in an auto-regressive fashion. Such noisy context can further degrade the benefits of context input, resulting in suboptimal ASR performance. In this paper, we introduce Context Noise Representation Learning (CNRL) to enhance robustness against noisy context, ultimately improving dialogue speech recognition accuracy. To maximize…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and dialogue systems · Speech Recognition and Synthesis · Speech and Audio Processing
