Can Visual Context Improve Automatic Speech Recognition for an Embodied   Agent?

Pradip Pramanick; Chayan Sarkar

arXiv:2210.13189·eess.AS·October 25, 2022

Can Visual Context Improve Automatic Speech Recognition for an Embodied Agent?

Pradip Pramanick, Chayan Sarkar

PDF

Open Access

TL;DR

This paper introduces a novel method to enhance automatic speech recognition in robots by integrating visual context, significantly reducing word error rate and improving recognition of entity descriptions in noisy or accented speech.

Contribution

The paper proposes a new decoder biasing technique that incorporates visual information into ASR, improving accuracy without degrading performance in incorrect contexts.

Findings

01

Achieved 59% relative reduction in WER

02

Improved recognition of entity descriptions in noisy conditions

03

Enhanced ASR performance with visual context integration

Abstract

The usage of automatic speech recognition (ASR) systems are becoming omnipresent ranging from personal assistant to chatbots, home, and industrial automation systems, etc. Modern robots are also equipped with ASR capabilities for interacting with humans as speech is the most natural interaction modality. However, ASR in robots faces additional challenges as compared to a personal assistant. Being an embodied agent, a robot must recognize the physical entities around it and therefore reliably recognize the speech containing the description of such entities. However, current ASR systems are often unable to do so due to limitations in ASR training, such as generic datasets and open-vocabulary modeling. Also, adverse conditions during inference, such as noise, accented, and far-field speech makes the transcription inaccurate. In this work, we present a method to incorporate a robot's visual…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Speech and dialogue systems