Fine-Grained Grounding for Multimodal Speech Recognition

Tejas Srinivasan; Ramon Sanabria; Florian Metze; Desmond Elliott

arXiv:2010.02384·cs.CL·October 7, 2020

Fine-Grained Grounding for Multimodal Speech Recognition

Tejas Srinivasan, Ramon Sanabria, Florian Metze, Desmond Elliott

PDF

1 Repo

TL;DR

This paper introduces a fine-grained visual grounding approach for multimodal speech recognition, utilizing local image regions to improve recognition of diverse word types beyond entities.

Contribution

It presents a novel model that leverages object proposals for localized visual context, enhancing the recovery of various word types in multimodal speech recognition.

Findings

01

Improves recognition accuracy over global visual features

02

Enables recovery of adjectives and verbs

03

Localization of proposals is key to performance

Abstract

Multimodal automatic speech recognition systems integrate information from images to improve speech recognition quality, by grounding the speech in the visual context. While visual signals have been shown to be useful for recovering entities that have been masked in the audio, these models should be capable of recovering a broader range of word types. Existing systems rely on global visual features that represent the entire image, but localizing the relevant regions of the image will make it possible to recover a larger set of words, such as adjectives and verbs. In this paper, we propose a model that uses finer-grained visual information from different parts of the image, using automatic object proposals. In experiments on the Flickr8K Audio Captions Corpus, we find that our model improves over approaches that use global visual features, that the proposals enable the model to recover…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

tejas1995/MultimodalASR
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.