Audio-Visual Speech Recognition With A Hybrid CTC/Attention Architecture
Stavros Petridis, Themos Stafylakis, Pingchuan Ma, Georgios, Tzimiropoulos, Maja Pantic

TL;DR
This paper introduces a novel hybrid CTC/attention architecture for audio-visual speech recognition, achieving state-of-the-art results and robustness against noise by combining sequential and nonsequential alignment methods.
Contribution
First application of hybrid CTC/attention architecture to audio-visual speech recognition, improving accuracy and noise robustness on the LRS2 dataset.
Findings
Achieved 7% word error rate on LRS2, setting new state-of-the-art.
Reduced word error rate by 1.3% compared to audio-only models.
Significantly outperformed audio-only models under noisy conditions, up to 32.9% improvement.
Abstract
Recent works in speech recognition rely either on connectionist temporal classification (CTC) or sequence-to-sequence models for character-level recognition. CTC assumes conditional independence of individual characters, whereas attention-based models can provide nonsequential alignments. Therefore, we could use a CTC loss in combination with an attention-based model in order to force monotonic alignments and at the same time get rid of the conditional independence assumption. In this paper, we use the recently proposed hybrid CTC/attention architecture for audio-visual recognition of speech in-the-wild. To the best of our knowledge, this is the first time that such a hybrid architecture architecture is used for audio-visual recognition of speech. We use the LRS2 database and show that the proposed audio-visual model leads to an 1.3% absolute decrease in word error rate over the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsConnectionist Temporal Classification Loss
