Multimodal Speech Recognition with Unstructured Audio Masking
Tejas Srinivasan, Ramon Sanabria, Florian Metze, Desmond Elliott

TL;DR
This paper introduces a realistic unstructured masking approach for multimodal speech recognition, demonstrating that visual context helps recover masked words and improves robustness in noisy scenarios.
Contribution
It proposes RandWordMask, a novel unstructured masking method, and shows that multimodal ASR can effectively leverage visual cues in realistic noisy conditions.
Findings
Multimodal ASR can recover masked words in unstructured masking scenarios.
Models attend to visual signals when audio is corrupted.
Visual context improves robustness in noisy speech recognition.
Abstract
Visual context has been shown to be useful for automatic speech recognition (ASR) systems when the speech signal is noisy or corrupted. Previous work, however, has only demonstrated the utility of visual context in an unrealistic setting, where a fixed set of words are systematically masked in the audio. In this paper, we simulate a more realistic masking scenario during model training, called RandWordMask, where the masking can occur for any word segment. Our experiments on the Flickr 8K Audio Captions Corpus show that multimodal ASR can generalize to recover different types of masked words in this unstructured masking setting. Moreover, our analysis shows that our models are capable of attending to the visual signal when the audio signal is corrupted. These results show that multimodal ASR systems can leverage the visual signal in more generalized noisy scenarios.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
