AV-CrossNet: an Audiovisual Complex Spectral Mapping Network for Speech   Separation By Leveraging Narrow- and Cross-Band Modeling

Vahid Ahmadi Kalkhorani; Cheng Yu; Anurag Kumar; Ke Tan; Buye Xu,; DeLiang Wang

arXiv:2406.11619·eess.AS·June 18, 2024

AV-CrossNet: an Audiovisual Complex Spectral Mapping Network for Speech Separation By Leveraging Narrow- and Cross-Band Modeling

Vahid Ahmadi Kalkhorani, Cheng Yu, Anurag Kumar, Ke Tan, Buye Xu,, DeLiang Wang

PDF

Open Access 1 Repo

TL;DR

AV-CrossNet is a novel audiovisual speech separation network that integrates visual cues with complex spectral mapping, significantly improving performance across multiple datasets and challenging conditions.

Contribution

This paper introduces AV-CrossNet, a new audiovisual speech separation model that effectively fuses visual and audio features using an extended CrossNet architecture with attention mechanisms.

Findings

01

Achieves state-of-the-art results on multiple datasets

02

Performs well even on untrained and mismatched datasets

03

Enhances speech separation by leveraging visual cues

Abstract

Adding visual cues to audio-based speech separation can improve separation performance. This paper introduces AV-CrossNet, an audiovisual (AV) system for speech enhancement, target speaker extraction, and multi-talker speaker separation. AV-CrossNet is extended from the CrossNet architecture, which is a recently proposed network that performs complex spectral mapping for speech separation by leveraging global attention and positional encoding. To effectively utilize visual cues, the proposed system incorporates pre-extracted visual embeddings and employs a visual encoder comprising temporal convolutional layers. Audio and visual features are fused in an early fusion layer before feeding to AV-CrossNet blocks. We evaluate AV-CrossNet on multiple datasets, including LRS, VoxCeleb, and COG-MHEAR challenge. Evaluation results demonstrate that AV-CrossNet advances the state-of-the-art…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

ahmadikalkhorani/AVCrossNet
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Music and Audio Processing · Hearing Loss and Rehabilitation

MethodsSoftmax · Attention Is All You Need