Audio-Visual Speech Recognition With A Hybrid CTC/Attention Architecture

Stavros Petridis; Themos Stafylakis; Pingchuan Ma; Georgios; Tzimiropoulos; Maja Pantic

arXiv:1810.00108·cs.CV·October 2, 2018

Audio-Visual Speech Recognition With A Hybrid CTC/Attention Architecture

Stavros Petridis, Themos Stafylakis, Pingchuan Ma, Georgios, Tzimiropoulos, Maja Pantic

PDF

TL;DR

This paper introduces a novel hybrid CTC/attention architecture for audio-visual speech recognition, achieving state-of-the-art results and robustness against noise by combining sequential and nonsequential alignment methods.

Contribution

First application of hybrid CTC/attention architecture to audio-visual speech recognition, improving accuracy and noise robustness on the LRS2 dataset.

Findings

01

Achieved 7% word error rate on LRS2, setting new state-of-the-art.

02

Reduced word error rate by 1.3% compared to audio-only models.

03

Significantly outperformed audio-only models under noisy conditions, up to 32.9% improvement.

Abstract

Recent works in speech recognition rely either on connectionist temporal classification (CTC) or sequence-to-sequence models for character-level recognition. CTC assumes conditional independence of individual characters, whereas attention-based models can provide nonsequential alignments. Therefore, we could use a CTC loss in combination with an attention-based model in order to force monotonic alignments and at the same time get rid of the conditional independence assumption. In this paper, we use the recently proposed hybrid CTC/attention architecture for audio-visual recognition of speech in-the-wild. To the best of our knowledge, this is the first time that such a hybrid architecture architecture is used for audio-visual recognition of speech. We use the LRS2 database and show that the proposed audio-visual model leads to an 1.3% absolute decrease in word error rate over the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsConnectionist Temporal Classification Loss