Human-Inspired Computing for Robust and Efficient Audio-Visual Speech Recognition
Qianhui Liu, Jiadong Wang, Yang Wang, Xin Yang, Gang Pan, Haizhou Li

TL;DR
This paper introduces HI-AVSNN, a human-inspired spiking neural network for audio-visual speech recognition that improves accuracy and real-time processing by mimicking human perception mechanisms.
Contribution
The paper proposes a novel SNN model with cueing interaction, causal processing, and spike activity, specifically designed for robust AVSR, incorporating visual cues and event-based visual data.
Findings
Outperforms existing SNN-based AVSR methods by 2.27% in accuracy.
Effectively integrates visual and auditory cues for improved speech recognition.
Demonstrates real-time applicability with causal, spike-based processing.
Abstract
Humans naturally perform audiovisual speech recognition (AVSR), enhancing the accuracy and robustness by integrating auditory and visual information. Spiking neural networks (SNNs), which mimic the brain's information-processing mechanisms, are well-suited for emulating the human capability of AVSR. Despite their potential, research on SNNs for AVSR is scarce, with most existing audio-visual multimodal methods focused on object or digit recognition. These models simply integrate features from both modalities, neglecting their unique characteristics and interactions. Additionally, they often rely on future information for current processing, which increases recognition latency and limits real-time applicability. Inspired by human speech perception, this paper proposes a novel human-inspired SNN named HI-AVSNN for AVSR, incorporating three key characteristics: cueing interaction, causal…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and Audio Processing
