End-to-End Audiovisual Fusion with LSTMs
Stavros Petridis, Yujiang Wang, Zuwei Li, Maja Pantic

TL;DR
This paper introduces a novel end-to-end audiovisual fusion model using BLSTMs that learns directly from raw pixels and spectrograms for speech and vocalization classification, outperforming previous methods especially in noisy conditions.
Contribution
It presents the first audiovisual fusion model that jointly learns feature extraction and classification directly from raw data using BLSTMs, advancing end-to-end multimodal speech recognition.
Findings
Achieved 1.9% improvement in nonlinguistic vocalization classification over audio-only models.
Improved state-of-the-art performance on the AVIC database with a 9.7% increase in mean F1.
Significantly outperformed audio-only models in noisy environments across multiple views.
Abstract
Several end-to-end deep learning approaches have been recently presented which simultaneously extract visual features from the input images and perform visual speech classification. However, research on jointly extracting audio and visual features and performing classification is very limited. In this work, we present an end-to-end audiovisual model based on Bidirectional Long Short-Term Memory (BLSTM) networks. To the best of our knowledge, this is the first audiovisual fusion model which simultaneously learns to extract features directly from the pixels and spectrograms and perform classification of speech and nonlinguistic vocalisations. The model consists of multiple identical streams, one for each modality, which extract features directly from mouth regions and spectrograms. The temporal dynamics in each stream/modality are modeled by a BLSTM and the fusion of multiple…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
