# Multimodal Speech Recognition for Language-Guided Embodied Agents

**Authors:** Allen Chang, Xiaoyuan Zhu, Aarav Monga, Seoho Ahn, Tejas Srinivasan,, Jesse Thomason

arXiv: 2302.14030 · 2023-10-11

## TL;DR

This paper introduces a multimodal automatic speech recognition model that leverages visual context to improve transcription accuracy of spoken instructions for embodied agents, leading to better task completion.

## Contribution

It presents a novel multimodal ASR approach that incorporates visual observations to enhance transcription accuracy in noisy, real-world scenarios for embodied agents.

## Key findings

- Multimodal ASR recovers up to 30% more masked words than unimodal models.
- Using visual context improves task success rates for language-guided embodied agents.
- Multimodal transcriptions lead to better agent performance in simulated environments.

## Abstract

Benchmarks for language-guided embodied agents typically assume text-based instructions, but deployed agents will encounter spoken instructions. While Automatic Speech Recognition (ASR) models can bridge the input gap, erroneous ASR transcripts can hurt the agents' ability to complete tasks. In this work, we propose training a multimodal ASR model to reduce errors in transcribing spoken instructions by considering the accompanying visual context. We train our model on a dataset of spoken instructions, synthesized from the ALFRED task completion dataset, where we simulate acoustic noise by systematically masking spoken words. We find that utilizing visual observations facilitates masked word recovery, with multimodal ASR models recovering up to 30% more masked words than unimodal baselines. We also find that a text-trained embodied agent successfully completes tasks more often by following transcribed instructions from multimodal ASR models. github.com/Cylumn/embodied-multimodal-asr

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/2302.14030/full.md

## Figures

5 figures with captions in the complete paper: https://tomesphere.com/paper/2302.14030/full.md

## References

34 references — full list in the complete paper: https://tomesphere.com/paper/2302.14030/full.md

---
Source: https://tomesphere.com/paper/2302.14030