Analyzing Utility of Visual Context in Multimodal Speech Recognition   Under Noisy Conditions

Tejas Srinivasan; Ramon Sanabria; Florian Metze

arXiv:1907.00477·cs.CL·January 1, 2020·6 cites

Analyzing Utility of Visual Context in Multimodal Speech Recognition Under Noisy Conditions

Tejas Srinivasan, Ramon Sanabria, Florian Metze

PDF

Open Access

TL;DR

This paper investigates how visual context influences multimodal speech recognition performance under noisy conditions, revealing that current models do not effectively utilize visual information when audio signals are degraded.

Contribution

The study evaluates the utility of visual context in multimodal speech recognition under adversarial noise, highlighting limitations in current integration methods.

Findings

01

Multimodal models outperform unimodal models by up to 4.2% WER in clean conditions.

02

Visual information is not utilized when audio is corrupted by noise.

03

Current multimodal integration techniques do not enhance robustness to noisy audio.

Abstract

Multimodal learning allows us to leverage information from multiple sources (visual, acoustic and text), similar to our experience of the real world. However, it is currently unclear to what extent auxiliary modalities improve performance over unimodal models, and under what circumstances the auxiliary modalities are useful. We examine the utility of the auxiliary visual context in Multimodal Automatic Speech Recognition in adversarial settings, where we deprive the models from partial audio signal during inference time. Our experiments show that while MMASR models show significant gains over traditional speech-to-text architectures (upto 4.2% WER improvements), they do not incorporate visual information when the audio signal has been corrupted. This shows that current methods of integrating the visual modality do not improve model robustness to noise, and we need better visually…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Music and Audio Processing · Speech Recognition and Synthesis