Separate in the Speech Chain: Cross-Modal Conditional Audio-Visual   Target Speech Extraction

Zhaoxi Mu; Xinyu Yang

arXiv:2404.12725·cs.SD·May 7, 2024

Separate in the Speech Chain: Cross-Modal Conditional Audio-Visual Target Speech Extraction

Zhaoxi Mu, Xinyu Yang

PDF

Open Access

TL;DR

This paper introduces AVSepChain, a novel two-stage framework inspired by the speech chain concept, to address modality imbalance in audio-visual speech extraction by alternating the dominant modality between perception and production stages.

Contribution

It proposes a cross-modal, two-stage speech chain model with a contrastive semantic matching loss to improve audio-visual speech extraction performance.

Findings

01

Achieves superior results on multiple benchmark datasets.

02

Effectively balances modality influence in speech extraction.

03

Enhances semantic alignment between lip movements and generated speech.

Abstract

The integration of visual cues has revitalized the performance of the target speech extraction task, elevating it to the forefront of the field. Nevertheless, this multi-modal learning paradigm often encounters the challenge of modality imbalance. In audio-visual target speech extraction tasks, the audio modality tends to dominate, potentially overshadowing the importance of visual guidance. To tackle this issue, we propose AVSepChain, drawing inspiration from the speech chain concept. Our approach partitions the audio-visual target speech extraction task into two stages: speech perception and speech production. In the speech perception stage, audio serves as the dominant modality, while visual information acts as the conditional modality. Conversely, in the speech production stage, the roles are reversed. This transformation of modality status aims to alleviate the problem of modality…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Speech Recognition and Synthesis