Audio Visual Speech Recognition using Deep Recurrent Neural Networks
Abhinav Thanda, Shankar M Venkatesan

TL;DR
This paper presents a deep recurrent neural network approach for audio-visual speech recognition, demonstrating improved accuracy through feature fusion and visual feature bottlenecking on the GRID corpus.
Contribution
Introduces a novel training algorithm for AV-ASR using deep RNNs with bottleneck visual features and compares fusion methods, enhancing speech recognition performance.
Findings
Visual modality improves CER significantly under noise conditions.
Bottleneck features aid in model convergence during training.
Feature fusion outperforms decision fusion in accuracy.
Abstract
In this work, we propose a training algorithm for an audio-visual automatic speech recognition (AV-ASR) system using deep recurrent neural network (RNN).First, we train a deep RNN acoustic model with a Connectionist Temporal Classification (CTC) objective function. The frame labels obtained from the acoustic model are then used to perform a non-linear dimensionality reduction of the visual features using a deep bottleneck network. Audio and visual features are fused and used to train a fusion RNN. The use of bottleneck features for visual modality helps the model to converge properly during training. Our system is evaluated on GRID corpus. Our results show that presence of visual modality gives significant improvement in character error rate (CER) at various levels of noise even when the model is trained without noisy data. We also provide a comparison of two fusion methods: feature…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and Audio Processing · Music and Audio Processing · Speech Recognition and Synthesis
