End-to-end Audiovisual Speech Recognition

Stavros Petridis; Themos Stafylakis; Pingchuan Ma; Feipeng Cai,; Georgios Tzimiropoulos; Maja Pantic

arXiv:1802.06424·cs.CV·February 23, 2018

End-to-end Audiovisual Speech Recognition

Stavros Petridis, Themos Stafylakis, Pingchuan Ma, Feipeng Cai,, Georgios Tzimiropoulos, Maja Pantic

PDF

2 Repos

TL;DR

This paper introduces a novel end-to-end audiovisual speech recognition model that learns directly from raw images and audio signals, outperforming audio-only models especially in noisy environments.

Contribution

It presents the first audiovisual fusion model that simultaneously learns feature extraction and recognition from raw pixels and waveforms using residual networks and BGRUs.

Findings

01

Outperforms audio-only models in noisy conditions

02

Achieves slight improvement in clean audio recognition

03

Significantly better performance under high noise levels

Abstract

Several end-to-end deep learning approaches have been recently presented which extract either audio or visual features from the input images or audio signals and perform speech recognition. However, research on end-to-end audiovisual models is very limited. In this work, we present an end-to-end audiovisual model based on residual networks and Bidirectional Gated Recurrent Units (BGRUs). To the best of our knowledge, this is the first audiovisual fusion model which simultaneously learns to extract features directly from the image pixels and audio waveforms and performs within-context word recognition on a large publicly available dataset (LRW). The model consists of two streams, one for each modality, which extract features directly from mouth regions and raw waveforms. The temporal dynamics in each stream/modality are modeled by a 2-layer BGRU and the fusion of multiple…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.