Should we hard-code the recurrence concept or learn it instead ?   Exploring the Transformer architecture for Audio-Visual Speech Recognition

George Sterpu; Christian Saam; Naomi Harte

arXiv:2005.09297·eess.AS·May 20, 2020

Should we hard-code the recurrence concept or learn it instead ? Exploring the Transformer architecture for Audio-Visual Speech Recognition

George Sterpu, Christian Saam, Naomi Harte

PDF

1 Repo

TL;DR

This paper compares LSTM and Transformer architectures for audio-visual speech recognition, showing that Transformers can learn cross-modal alignments but face similar convergence issues as LSTMs, highlighting the need for further research.

Contribution

The study introduces a Transformer-based variant of AV Align and provides a detailed comparison with LSTM, revealing insights into their strengths, weaknesses, and convergence challenges.

Findings

01

Transformers learn cross-modal monotonic alignments.

02

Both models face convergence issues with visual modality.

03

Transformers show comparable performance to LSTMs in AVSR.

Abstract

The audio-visual speech fusion strategy AV Align has shown significant performance improvements in audio-visual speech recognition (AVSR) on the challenging LRS2 dataset. Performance improvements range between 7% and 30% depending on the noise level when leveraging the visual modality of speech in addition to the auditory one. This work presents a variant of AV Align where the recurrent Long Short-term Memory (LSTM) computation block is replaced by the more recently proposed Transformer block. We compare the two methods, discussing in greater detail their strengths and weaknesses. We find that Transformers also learn cross-modal monotonic alignments, but suffer from the same visual convergence problems as the LSTM model, calling for a deeper investigation into the dominant modality problem in machine learning.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

georgesterpu/Taris
tf

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsLinear Layer · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Residual Connection · Label Smoothing · Multi-Head Attention · Adam · Sigmoid Activation · *Communicated@Fast*How Do I Communicate to Expedia? · Dropout