Combining Multiple Views for Visual Speech Recognition
Marina Zimmermann, Mostafa Mehdipour Ghazi, Haz{\i}m Kemal Ekenel,, Jean-Philippe Thiran

TL;DR
This paper investigates combining multiple camera views to enhance visual speech recognition, demonstrating significant performance improvements through various fusion strategies and view combinations.
Contribution
It provides a comprehensive analysis of multi-view fusion at feature and decision levels for visual speech recognition, highlighting the benefits of combining different camera angles.
Findings
Multi-view fusion improves recognition accuracy.
Combining views increases sentence correctness from 76% to 83%.
Fusion at decision level yields significant performance gains.
Abstract
Visual speech recognition is a challenging research problem with a particular practical application of aiding audio speech recognition in noisy scenarios. Multiple camera setups can be beneficial for the visual speech recognition systems in terms of improved performance and robustness. In this paper, we explore this aspect and provide a comprehensive study on combining multiple views for visual speech recognition. The thorough analysis covers fusion of all possible view angle combinations both at feature level and decision level. The employed visual speech recognition system in this study extracts features through a PCA-based convolutional neural network, followed by an LSTM network. Finally, these features are processed in a tandem system, being fed into a GMM-HMM scheme. The decision fusion acts after this point by combining the Viterbi path log-likelihoods. The results show that the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsSigmoid Activation · Tanh Activation · Long Short-Term Memory
