End-to-End Audiovisual Fusion with LSTMs

Stavros Petridis; Yujiang Wang; Zuwei Li; Maja Pantic

arXiv:1709.04343·cs.CV·September 14, 2017

End-to-End Audiovisual Fusion with LSTMs

Stavros Petridis, Yujiang Wang, Zuwei Li, Maja Pantic

PDF

TL;DR

This paper introduces a novel end-to-end audiovisual fusion model using BLSTMs that learns directly from raw pixels and spectrograms for speech and vocalization classification, outperforming previous methods especially in noisy conditions.

Contribution

It presents the first audiovisual fusion model that jointly learns feature extraction and classification directly from raw data using BLSTMs, advancing end-to-end multimodal speech recognition.

Findings

01

Achieved 1.9% improvement in nonlinguistic vocalization classification over audio-only models.

02

Improved state-of-the-art performance on the AVIC database with a 9.7% increase in mean F1.

03

Significantly outperformed audio-only models in noisy environments across multiple views.

Abstract

Several end-to-end deep learning approaches have been recently presented which simultaneously extract visual features from the input images and perform visual speech classification. However, research on jointly extracting audio and visual features and performing classification is very limited. In this work, we present an end-to-end audiovisual model based on Bidirectional Long Short-Term Memory (BLSTM) networks. To the best of our knowledge, this is the first audiovisual fusion model which simultaneously learns to extract features directly from the pixels and spectrograms and perform classification of speech and nonlinguistic vocalisations. The model consists of multiple identical streams, one for each modality, which extract features directly from mouth regions and spectrograms. The temporal dynamics in each stream/modality are modeled by a BLSTM and the fusion of multiple…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.