Predict-and-Update Network: Audio-Visual Speech Recognition Inspired by   Human Speech Perception

Jiadong Wang; Xinyuan Qian; Haizhou Li

arXiv:2209.01768·cs.MM·September 7, 2022·6 cites

Predict-and-Update Network: Audio-Visual Speech Recognition Inspired by Human Speech Perception

Jiadong Wang, Xinyuan Qian, Haizhou Li

PDF

Open Access

TL;DR

This paper introduces a Predict-and-Update Network that mimics human visual cueing to improve audio-visual speech recognition, significantly reducing word error rates especially in noisy environments.

Contribution

It proposes a novel visual cueing mechanism with a cross-modal Conformer, enhancing AVSR performance beyond existing methods.

Findings

01

Outperforms state-of-the-art AVSR methods on LRS2-BBC and LRS3-BBC datasets.

02

Reduces Word Error Rate by over 10% in clean and 40% in noisy conditions.

03

Validates the effectiveness of visual cueing in multi-modal speech recognition.

Abstract

Audio and visual signals complement each other in human speech perception, so do they in speech recognition. The visual hint is less evident than the acoustic hint, but more robust in a complex acoustic environment, as far as speech perception is concerned. It remains a challenge how we effectively exploit the interaction between audio and visual signals for automatic speech recognition. There have been studies to exploit visual signals as redundant or complementary information to audio input in a synchronous manner. Human studies suggest that visual signal primes the listener in advance as to when and on which frequency to attend to. We propose a Predict-and-Update Network (P&U net), to simulate such a visual cueing mechanism for Audio-Visual Speech Recognition (AVSR). In particular, we first predict the character posteriors of the spoken words, i.e. the visual embedding, based on the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Music and Audio Processing · Hearing Loss and Rehabilitation