Audio-visual Multi-channel Recognition of Overlapped Speech
Jianwei Yu, Bo Wu, Rongzhi Gu, Shi-Xiong Zhang, Lianwu Chen, Yong Xu., Meng Yu, Dan Su, Dong Yu, Xunying Liu, Helen Meng

TL;DR
This paper introduces an audio-visual multi-channel speech recognition system that effectively separates and recognizes overlapped speech by integrating visual cues with advanced audio separation techniques, significantly reducing word error rates.
Contribution
The paper presents a novel multi-channel AVSR system with tightly integrated separation and recognition modules, fine-tuned jointly to improve overlapped speech recognition accuracy.
Findings
Outperforms audio-only ASR by up to 6.81% absolute WER reduction.
Achieves up to 56.87% relative WER reduction on LRS2 dataset.
Demonstrates effectiveness of visual cues in multi-channel speech separation.
Abstract
Automatic speech recognition (ASR) of overlapped speech remains a highly challenging task to date. To this end, multi-channel microphone array data are widely used in state-of-the-art ASR systems. Motivated by the invariance of visual modality to acoustic signal corruption, this paper presents an audio-visual multi-channel overlapped speech recognition system featuring tightly integrated separation front-end and recognition back-end. A series of audio-visual multi-channel speech separation front-end components based on \textit{TF masking}, \textit{filter\&sum} and \textit{mask-based MVDR} beamforming approaches were developed. To reduce the error cost mismatch between the separation and recognition components, they were jointly fine-tuned using the connectionist temporal classification (CTC) loss function, or a multi-task criterion interpolation with scale-invariant signal to noise…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and Audio Processing · Advanced Adaptive Filtering Techniques · Music and Audio Processing
