Auxiliary Multimodal LSTM for Audio-visual Speech Recognition and Lipreading
Chunlin Tian, Weijun Ji

TL;DR
This paper introduces an Auxiliary Multimodal LSTM model for audio-visual speech recognition, which effectively fuses audio and video data in an end-to-end manner, improving robustness and accuracy over traditional and previous deep learning models.
Contribution
The paper presents a novel end-to-end multimodal LSTM architecture that balances modal and temporal fusion, simplifies training, and enhances performance in AVSR tasks.
Findings
am-LSTM outperforms traditional methods on three datasets.
The model is easier to train and less prone to overfitting.
It effectively combines audio and visual information for speech recognition.
Abstract
The Aduio-visual Speech Recognition (AVSR) which employs both the video and audio information to do Automatic Speech Recognition (ASR) is one of the application of multimodal leaning making ASR system more robust and accuracy. The traditional models usually treated AVSR as inference or projection but strict prior limits its ability. As the revival of deep learning, Deep Neural Networks (DNN) becomes an important toolkit in many traditional classification tasks including ASR, image classification, natural language processing. Some DNN models were used in AVSR like Multimodal Deep Autoencoders (MDAEs), Multimodal Deep Belief Network (MDBN) and Multimodal Deep Boltzmann Machine (MDBM) that actually work better than traditional methods. However, such DNN models have several shortcomings: (1) They don't balance the modal fusion and temporal fusion, or even haven't temporal fusion; (2)The…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and Audio Processing · Speech Recognition and Synthesis · Music and Audio Processing
MethodsSigmoid Activation · Tanh Activation · Deep Belief Network · Long Short-Term Memory
