Audio-Visual Decision Fusion for WFST-based and seq2seq Models

Rohith Aralikatti; Sharad Roy; Abhinav Thanda; Dilip Kumar Margam,; Pujitha Appan Kandala; Tanay Sharma; Shankar M Venkatesan

arXiv:2001.10832·eess.AS·January 30, 2020·1 cites

Audio-Visual Decision Fusion for WFST-based and seq2seq Models

Rohith Aralikatti, Sharad Roy, Abhinav Thanda, Dilip Kumar Margam,, Pujitha Appan Kandala, Tanay Sharma, Shankar M Venkatesan

PDF

Open Access

TL;DR

This paper introduces novel audio-visual fusion methods for speech recognition that improve accuracy under noisy conditions by independently training models and combining their outputs at inference time.

Contribution

It proposes new fusion techniques for WFST-based and seq2seq models that enable independent training and effective inference-time combination of audio and visual data.

Findings

01

Significant WER reduction over acoustic-only systems at various SNRs

02

Effective fusion without weighing parameter in seq2seq models

03

Improved robustness in noisy speech recognition scenarios

Abstract

Under noisy conditions, speech recognition systems suffer from high Word Error Rates (WER). In such cases, information from the visual modality comprising the speaker lip movements can help improve the performance. In this work, we propose novel methods to fuse information from audio and visual modalities at inference time. This enables us to train the acoustic and visual models independently. First, we train separate RNN-HMM based acoustic and visual models. A common WFST generated by taking a special union of the HMM components is used for decoding using a modified Viterbi algorithm. Second, we train separate seq2seq acoustic and visual models. The decoding step is performed simultaneously for both modalities using shallow fusion while maintaining a common hypothesis beam. We also present results for a novel seq2seq fusion without the weighing parameter. We present results at varying…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Music and Audio Processing · Speech Recognition and Synthesis

MethodsSigmoid Activation · Tanh Activation · Long Short-Term Memory · Sequence to Sequence