Auxiliary Multimodal LSTM for Audio-visual Speech Recognition and   Lipreading

Chunlin Tian; Weijun Ji

arXiv:1701.04224·cs.CV·March 20, 2017·5 cites

Auxiliary Multimodal LSTM for Audio-visual Speech Recognition and Lipreading

Chunlin Tian, Weijun Ji

PDF

Open Access

TL;DR

This paper introduces an Auxiliary Multimodal LSTM model for audio-visual speech recognition, which effectively fuses audio and video data in an end-to-end manner, improving robustness and accuracy over traditional and previous deep learning models.

Contribution

The paper presents a novel end-to-end multimodal LSTM architecture that balances modal and temporal fusion, simplifies training, and enhances performance in AVSR tasks.

Findings

01

am-LSTM outperforms traditional methods on three datasets.

02

The model is easier to train and less prone to overfitting.

03

It effectively combines audio and visual information for speech recognition.

Abstract

The Aduio-visual Speech Recognition (AVSR) which employs both the video and audio information to do Automatic Speech Recognition (ASR) is one of the application of multimodal leaning making ASR system more robust and accuracy. The traditional models usually treated AVSR as inference or projection but strict prior limits its ability. As the revival of deep learning, Deep Neural Networks (DNN) becomes an important toolkit in many traditional classification tasks including ASR, image classification, natural language processing. Some DNN models were used in AVSR like Multimodal Deep Autoencoders (MDAEs), Multimodal Deep Belief Network (MDBN) and Multimodal Deep Boltzmann Machine (MDBM) that actually work better than traditional methods. However, such DNN models have several shortcomings: (1) They don't balance the modal fusion and temporal fusion, or even haven't temporal fusion; (2)The…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Speech Recognition and Synthesis · Music and Audio Processing

MethodsSigmoid Activation · Tanh Activation · Deep Belief Network · Long Short-Term Memory