# An Empirical Analysis of Deep Audio-Visual Models for Speech Recognition

**Authors:** Devesh Walawalkar, Yihui He, Rohit Pillai

arXiv: 1812.09336 · 2018-12-27

## TL;DR

This paper empirically evaluates deep audio-visual speech recognition models, focusing on CNN-based architectures, attention mechanisms, and robustness to noise, to understand their performance and improvements over existing methods.

## Contribution

It re-implements and extends state-of-the-art models, providing comprehensive experiments on attention, backbone networks, and noise sensitivity in audio-visual speech recognition.

## Key findings

- Attention mechanisms improve model focus on relevant features.
- Pre-trained residual networks enhance recognition accuracy.
- Models show robustness to audio noise with visual cues.

## Abstract

In this project, we worked on speech recognition, specifically predicting individual words based on both the video frames and audio. Empowered by convolutional neural networks, the recent speech recognition and lip reading models are comparable to human level performance. We re-implemented and made derivations of the state-of-the-art model. Then, we conducted rich experiments including the effectiveness of attention mechanism, more accurate residual network as the backbone with pre-trained weights and the sensitivity of our model with respect to audio input with/without noise.

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/1812.09336/full.md

## Figures

2 figures with captions in the complete paper: https://tomesphere.com/paper/1812.09336/full.md

## References

41 references — full list in the complete paper: https://tomesphere.com/paper/1812.09336/full.md

---
Source: https://tomesphere.com/paper/1812.09336