Audio-Visual Decision Fusion for WFST-based and seq2seq Models
Rohith Aralikatti, Sharad Roy, Abhinav Thanda, Dilip Kumar Margam,, Pujitha Appan Kandala, Tanay Sharma, Shankar M Venkatesan

TL;DR
This paper introduces novel audio-visual fusion methods for speech recognition that improve accuracy under noisy conditions by independently training models and combining their outputs at inference time.
Contribution
It proposes new fusion techniques for WFST-based and seq2seq models that enable independent training and effective inference-time combination of audio and visual data.
Findings
Significant WER reduction over acoustic-only systems at various SNRs
Effective fusion without weighing parameter in seq2seq models
Improved robustness in noisy speech recognition scenarios
Abstract
Under noisy conditions, speech recognition systems suffer from high Word Error Rates (WER). In such cases, information from the visual modality comprising the speaker lip movements can help improve the performance. In this work, we propose novel methods to fuse information from audio and visual modalities at inference time. This enables us to train the acoustic and visual models independently. First, we train separate RNN-HMM based acoustic and visual models. A common WFST generated by taking a special union of the HMM components is used for decoding using a modified Viterbi algorithm. Second, we train separate seq2seq acoustic and visual models. The decoding step is performed simultaneously for both modalities using shallow fusion while maintaining a common hypothesis beam. We also present results for a novel seq2seq fusion without the weighing parameter. We present results at varying…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and Audio Processing · Music and Audio Processing · Speech Recognition and Synthesis
MethodsSigmoid Activation · Tanh Activation · Long Short-Term Memory · Sequence to Sequence
