Improving Multimodal Speech Recognition by Data Augmentation and Speech   Representations

Dan Oneata; Horia Cucu

arXiv:2204.13206·cs.SD·April 29, 2022

Improving Multimodal Speech Recognition by Data Augmentation and Speech Representations

Dan Oneata, Horia Cucu

PDF

Open Access

TL;DR

This paper enhances multimodal speech recognition by leveraging pretrained speech models and data augmentation techniques to better integrate visual information, resulting in improved performance across multiple datasets.

Contribution

It introduces the use of pretrained ASR models and novel speech data augmentation methods to improve multimodal speech recognition performance.

Findings

01

Pretrained ASR models significantly boost performance.

02

Speech data augmentation improves multimodal attention to visual stimuli.

03

Consistent gains observed across three datasets, including Localized Narratives.

Abstract

Multimodal speech recognition aims to improve the performance of automatic speech recognition (ASR) systems by leveraging additional visual information that is usually associated to the audio input. While previous approaches make crucial use of strong visual representations, e.g. by finetuning pretrained image recognition networks, significantly less attention has been paid to its counterpart: the speech component. In this work, we investigate ways of improving the base speech recognition system by following similar techniques to the ones used for the visual encoder, namely, transferring representations and data augmentation. First, we show that starting from a pretrained ASR significantly improves the state-of-the-art performance; remarkably, even when building upon a strong unimodal system, we still find gains by including the visual modality. Second, we employ speech data…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Speech Recognition and Synthesis · Music and Audio Processing