TL;DR
This paper compares LSTM and Transformer architectures for audio-visual speech recognition, showing that Transformers can learn cross-modal alignments but face similar convergence issues as LSTMs, highlighting the need for further research.
Contribution
The study introduces a Transformer-based variant of AV Align and provides a detailed comparison with LSTM, revealing insights into their strengths, weaknesses, and convergence challenges.
Findings
Transformers learn cross-modal monotonic alignments.
Both models face convergence issues with visual modality.
Transformers show comparable performance to LSTMs in AVSR.
Abstract
The audio-visual speech fusion strategy AV Align has shown significant performance improvements in audio-visual speech recognition (AVSR) on the challenging LRS2 dataset. Performance improvements range between 7% and 30% depending on the noise level when leveraging the visual modality of speech in addition to the auditory one. This work presents a variant of AV Align where the recurrent Long Short-term Memory (LSTM) computation block is replaced by the more recently proposed Transformer block. We compare the two methods, discussing in greater detail their strengths and weaknesses. We find that Transformers also learn cross-modal monotonic alignments, but suffer from the same visual convergence problems as the LSTM model, calling for a deeper investigation into the dominant modality problem in machine learning.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsLinear Layer · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Residual Connection · Label Smoothing · Multi-Head Attention · Adam · Sigmoid Activation · *Communicated@Fast*How Do I Communicate to Expedia? · Dropout
