Scenario-Aware Audio-Visual TF-GridNet for Target Speech Extraction

Zexu Pan; Gordon Wichern; Yoshiki Masuyama; Francois G. Germain,; Sameer Khurana; Chiori Hori; Jonathan Le Roux

arXiv:2310.19644·eess.AS·October 31, 2023·1 cites

Scenario-Aware Audio-Visual TF-GridNet for Target Speech Extraction

Zexu Pan, Gordon Wichern, Yoshiki Masuyama, Francois G. Germain,, Sameer Khurana, Chiori Hori, Jonathan Le Roux

PDF

Open Access

TL;DR

This paper introduces AV-GridNet and SAV-GridNet, innovative models for target speech extraction that incorporate visual cues and scenario awareness, achieving state-of-the-art results in audio-visual speech enhancement tasks.

Contribution

It presents a scenario-aware, visual-grounded extension of TF-GridNet that improves speech extraction by identifying interference types and applying specialized models.

Findings

01

Achieved state-of-the-art results on the COG-MHEAR challenge

02

Outperformed existing models in objective and listening tests

03

Provided detailed analysis of scenario-specific performance

Abstract

Target speech extraction aims to extract, based on a given conditioning cue, a target speech signal that is corrupted by interfering sources, such as noise or competing speakers. Building upon the achievements of the state-of-the-art (SOTA) time-frequency speaker separation model TF-GridNet, we propose AV-GridNet, a visual-grounded variant that incorporates the face recording of a target speaker as a conditioning factor during the extraction process. Recognizing the inherent dissimilarities between speech and noise signals as interfering sources, we also propose SAV-GridNet, a scenario-aware model that identifies the type of interfering scenario first and then applies a dedicated expert model trained specifically for that scenario. Our proposed model achieves SOTA results on the second COG-MHEAR Audio-Visual Speech Enhancement Challenge, outperforming other models by a significant…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Music and Audio Processing · Speech Recognition and Synthesis