Audio-visual End-to-end Multi-channel Speech Separation, Dereverberation   and Recognition

Guinan Li; Jiajun Deng; Mengzhe Geng; Zengrui Jin; Tianzi Wang; Shujie; Hu; Mingyu Cui; Helen Meng; Xunying Liu

arXiv:2307.02909·eess.AS·July 7, 2023·1 cites

Audio-visual End-to-end Multi-channel Speech Separation, Dereverberation and Recognition

Guinan Li, Jiajun Deng, Mengzhe Geng, Zengrui Jin, Tianzi Wang, Shujie, Hu, Mingyu Cui, Helen Meng, Xunying Liu

PDF

Open Access

TL;DR

This paper introduces an audio-visual multi-channel system that integrates visual information into speech separation, dereverberation, and recognition, significantly improving accuracy in challenging cocktail party scenarios.

Contribution

It proposes a fully integrated audio-visual approach with end-to-end fine-tuning, demonstrating substantial WER reductions over audio-only methods.

Findings

01

Achieved 9.1% and 6.2% absolute WER reduction over baselines.

02

Improved speech quality scores such as PESQ, STOI, SRMR.

03

Validated effectiveness on simulated and replayed Oxford LRS2 data.

Abstract

Accurate recognition of cocktail party speech containing overlapping speakers, noise and reverberation remains a highly challenging task to date. Motivated by the invariance of visual modality to acoustic signal corruption, an audio-visual multi-channel speech separation, dereverberation and recognition approach featuring a full incorporation of visual information into all system components is proposed in this paper. The efficacy of the video input is consistently demonstrated in mask-based MVDR speech separation, DNN-WPE or spectral mapping (SpecM) based speech dereverberation front-end and Conformer ASR back-end. Audio-visual integrated front-end architectures performing speech separation and dereverberation in a pipelined or joint fashion via mask-based WPD are investigated. The error cost mismatch between the speech enhancement front-end and ASR back-end components is minimized by…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Advanced Adaptive Filtering Techniques · Blind Source Separation Techniques