Multimodal Integration for Large-Vocabulary Audio-Visual Speech Recognition
Wentao Yu, Steffen Zeiler, Dorothea Kolossa

TL;DR
This paper investigates various multimodal integration strategies for large-vocabulary audio-visual speech recognition, emphasizing dynamic stream reliability indicators to improve robustness and accuracy in challenging conditions.
Contribution
It compares multiple integration methods for large-vocabulary speech recognition and highlights the effectiveness of dynamic stream reliability indicators in enhancing performance.
Findings
Dynamic stream reliability indicators significantly improve recognition accuracy.
Hybrid architectures benefit from visual information during audio distortions.
Early and end-to-end integration strategies are evaluated for large-vocabulary tasks.
Abstract
For many small- and medium-vocabulary tasks, audio-visual speech recognition can significantly improve the recognition rates compared to audio-only systems. However, there is still an ongoing debate regarding the best combination strategy for multi-modal information, which should allow for the translation of these gains to large-vocabulary recognition. While an integration at the level of state-posterior probabilities, using dynamic stream weighting, is almost universally helpful for small-vocabulary systems, in large-vocabulary speech recognition, the recognition accuracy remains difficult to improve. In the following, we specifically consider the large-vocabulary task of the LRS2 database, and we investigate a broad range of integration strategies, comparing early integration and end-to-end learning with many versions of hybrid recognition and dynamic stream weighting. One aspect,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Music and Audio Processing · Speech and Audio Processing
