# Combining Residual Networks with LSTMs for Lipreading

**Authors:** Themos Stafylakis, Georgios Tzimiropoulos

arXiv: 1703.04105 · 2017-09-11

## TL;DR

This paper introduces an end-to-end deep learning model combining residual networks and LSTMs for word-level visual speech recognition, achieving state-of-the-art accuracy on a challenging in-the-wild dataset.

## Contribution

It presents a novel architecture that integrates residual and LSTM networks for lipreading, improving accuracy without relying on word boundary information.

## Key findings

- Achieved 83.0% word accuracy on Lipreading In-The-Wild benchmark
- Outperformed previous state-of-the-art by 6.8%
- Demonstrated effectiveness of combined residual and LSTM architecture

## Abstract

We propose an end-to-end deep learning architecture for word-level visual speech recognition. The system is a combination of spatiotemporal convolutional, residual and bidirectional Long Short-Term Memory networks. We train and evaluate it on the Lipreading In-The-Wild benchmark, a challenging database of 500-size target-words consisting of 1.28sec video excerpts from BBC TV broadcasts. The proposed network attains word accuracy equal to 83.0, yielding 6.8 absolute improvement over the current state-of-the-art, without using information about word boundaries during training or testing.

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/1703.04105/full.md

## Figures

3 figures with captions in the complete paper: https://tomesphere.com/paper/1703.04105/full.md

## References

34 references — full list in the complete paper: https://tomesphere.com/paper/1703.04105/full.md

---
Source: https://tomesphere.com/paper/1703.04105