Predict-and-Update Network: Audio-Visual Speech Recognition Inspired by Human Speech Perception
Jiadong Wang, Xinyuan Qian, Haizhou Li

TL;DR
This paper introduces a Predict-and-Update Network that mimics human visual cueing to improve audio-visual speech recognition, significantly reducing word error rates especially in noisy environments.
Contribution
It proposes a novel visual cueing mechanism with a cross-modal Conformer, enhancing AVSR performance beyond existing methods.
Findings
Outperforms state-of-the-art AVSR methods on LRS2-BBC and LRS3-BBC datasets.
Reduces Word Error Rate by over 10% in clean and 40% in noisy conditions.
Validates the effectiveness of visual cueing in multi-modal speech recognition.
Abstract
Audio and visual signals complement each other in human speech perception, so do they in speech recognition. The visual hint is less evident than the acoustic hint, but more robust in a complex acoustic environment, as far as speech perception is concerned. It remains a challenge how we effectively exploit the interaction between audio and visual signals for automatic speech recognition. There have been studies to exploit visual signals as redundant or complementary information to audio input in a synchronous manner. Human studies suggest that visual signal primes the listener in advance as to when and on which frequency to attend to. We propose a Predict-and-Update Network (P&U net), to simulate such a visual cueing mechanism for Audio-Visual Speech Recognition (AVSR). In particular, we first predict the character posteriors of the spoken words, i.e. the visual embedding, based on the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and Audio Processing · Music and Audio Processing · Hearing Loss and Rehabilitation
