Attention-based Audio-Visual Fusion for Robust Automatic Speech   Recognition

George Sterpu; Christian Saam; Naomi Harte

arXiv:1809.01728·eess.AS·May 2, 2019

Attention-based Audio-Visual Fusion for Robust Automatic Speech Recognition

George Sterpu, Christian Saam, Naomi Harte

PDF

3 Repos

TL;DR

This paper introduces an attention-based audio-visual fusion method for automatic speech recognition that automatically aligns lip motion and audio signals, significantly improving recognition accuracy especially in noisy environments.

Contribution

It presents a novel attention-based fusion strategy that automatically aligns audio and visual modalities, enhancing speech recognition performance in noisy conditions.

Findings

01

Achieves up to 30% relative improvement over audio-only recognition.

02

Effectively handles various noise levels in speech recognition.

03

Easily integrates with existing sequence-to-sequence architectures.

Abstract

Automatic speech recognition can potentially benefit from the lip motion patterns, complementing acoustic speech to improve the overall recognition performance, particularly in noise. In this paper we propose an audio-visual fusion strategy that goes beyond simple feature concatenation and learns to automatically align the two modalities, leading to enhanced representations which increase the recognition accuracy in both clean and noisy conditions. We test our strategy on the TCD-TIMIT and LRS2 datasets, designed for large vocabulary continuous speech recognition, applying three types of noise at different power ratios. We also exploit state of the art Sequence-to-Sequence architectures, showing that our method can be easily integrated. Results show relative improvements from 7% up to 30% on TCD-TIMIT over the acoustic modality alone, depending on the acoustic noise level. We anticipate…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.