Can Visual Context Improve Automatic Speech Recognition for an Embodied Agent?
Pradip Pramanick, Chayan Sarkar

TL;DR
This paper introduces a novel method to enhance automatic speech recognition in robots by integrating visual context, significantly reducing word error rate and improving recognition of entity descriptions in noisy or accented speech.
Contribution
The paper proposes a new decoder biasing technique that incorporates visual information into ASR, improving accuracy without degrading performance in incorrect contexts.
Findings
Achieved 59% relative reduction in WER
Improved recognition of entity descriptions in noisy conditions
Enhanced ASR performance with visual context integration
Abstract
The usage of automatic speech recognition (ASR) systems are becoming omnipresent ranging from personal assistant to chatbots, home, and industrial automation systems, etc. Modern robots are also equipped with ASR capabilities for interacting with humans as speech is the most natural interaction modality. However, ASR in robots faces additional challenges as compared to a personal assistant. Being an embodied agent, a robot must recognize the physical entities around it and therefore reliably recognize the speech containing the description of such entities. However, current ASR systems are often unable to do so due to limitations in ASR training, such as generic datasets and open-vocabulary modeling. Also, adverse conditions during inference, such as noise, accented, and far-field speech makes the transcription inaccurate. In this work, we present a method to incorporate a robot's visual…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Speech and dialogue systems
