Audio-visual Multi-channel Recognition of Overlapped Speech

Jianwei Yu; Bo Wu; Rongzhi Gu; Shi-Xiong Zhang; Lianwu Chen; Yong Xu.; Meng Yu; Dan Su; Dong Yu; Xunying Liu; Helen Meng

arXiv:2005.08571·eess.AS·November 19, 2020·1 cites

Audio-visual Multi-channel Recognition of Overlapped Speech

Jianwei Yu, Bo Wu, Rongzhi Gu, Shi-Xiong Zhang, Lianwu Chen, Yong Xu., Meng Yu, Dan Su, Dong Yu, Xunying Liu, Helen Meng

PDF

Open Access

TL;DR

This paper introduces an audio-visual multi-channel speech recognition system that effectively separates and recognizes overlapped speech by integrating visual cues with advanced audio separation techniques, significantly reducing word error rates.

Contribution

The paper presents a novel multi-channel AVSR system with tightly integrated separation and recognition modules, fine-tuned jointly to improve overlapped speech recognition accuracy.

Findings

01

Outperforms audio-only ASR by up to 6.81% absolute WER reduction.

02

Achieves up to 56.87% relative WER reduction on LRS2 dataset.

03

Demonstrates effectiveness of visual cues in multi-channel speech separation.

Abstract

Automatic speech recognition (ASR) of overlapped speech remains a highly challenging task to date. To this end, multi-channel microphone array data are widely used in state-of-the-art ASR systems. Motivated by the invariance of visual modality to acoustic signal corruption, this paper presents an audio-visual multi-channel overlapped speech recognition system featuring tightly integrated separation front-end and recognition back-end. A series of audio-visual multi-channel speech separation front-end components based on \textit{TF masking}, \textit{filter\&sum} and \textit{mask-based MVDR} beamforming approaches were developed. To reduce the error cost mismatch between the separation and recognition components, they were jointly fine-tuned using the connectionist temporal classification (CTC) loss function, or a multi-task criterion interpolation with scale-invariant signal to noise…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Advanced Adaptive Filtering Techniques · Music and Audio Processing