TL;DR
This paper introduces an attention-based audio-visual fusion method for automatic speech recognition that automatically aligns lip motion and audio signals, significantly improving recognition accuracy especially in noisy environments.
Contribution
It presents a novel attention-based fusion strategy that automatically aligns audio and visual modalities, enhancing speech recognition performance in noisy conditions.
Findings
Achieves up to 30% relative improvement over audio-only recognition.
Effectively handles various noise levels in speech recognition.
Easily integrates with existing sequence-to-sequence architectures.
Abstract
Automatic speech recognition can potentially benefit from the lip motion patterns, complementing acoustic speech to improve the overall recognition performance, particularly in noise. In this paper we propose an audio-visual fusion strategy that goes beyond simple feature concatenation and learns to automatically align the two modalities, leading to enhanced representations which increase the recognition accuracy in both clean and noisy conditions. We test our strategy on the TCD-TIMIT and LRS2 datasets, designed for large vocabulary continuous speech recognition, applying three types of noise at different power ratios. We also exploit state of the art Sequence-to-Sequence architectures, showing that our method can be easily integrated. Results show relative improvements from 7% up to 30% on TCD-TIMIT over the acoustic modality alone, depending on the acoustic noise level. We anticipate…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
