AVFSNet: Audio-Visual Speech Separation for Flexible Number of Speakers with Multi-Scale and Multi-Task Learning

Daning Zhang; Ying Wei

arXiv:2507.12972·eess.AS·July 18, 2025

AVFSNet: Audio-Visual Speech Separation for Flexible Number of Speakers with Multi-Scale and Multi-Task Learning

Daning Zhang, Ying Wei

PDF

Open Access

TL;DR

AVFSNet is a novel audio-visual speech separation model that effectively handles unknown numbers of speakers by integrating multi-scale encoding and multi-task learning, achieving state-of-the-art results.

Contribution

The paper introduces AVFSNet, a new model that combines multi-scale encoding and parallel architecture for speaker counting and separation without prior speaker number knowledge.

Findings

01

Achieves state-of-the-art separation performance

02

Demonstrates robustness in noisy environments

03

Handles unknown speaker quantities effectively

Abstract

Separating target speech from mixed signals containing flexible speaker quantities presents a challenging task. While existing methods demonstrate strong separation performance and noise robustness, they predominantly assume prior knowledge of speaker counts in mixtures. The limited research addressing unknown speaker quantity scenarios exhibits significantly constrained generalization capabilities in real acoustic environments. To overcome these challenges, this paper proposes AVFSNet -- an audio-visual speech separation model integrating multi-scale encoding and parallel architecture -- jointly optimized for speaker counting and multi-speaker separation tasks. The model independently separates each speaker in parallel while enhancing environmental noise adaptability through visual information integration. Comprehensive experimental evaluations demonstrate that AVFSNet achieves…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Music and Audio Processing · Speech Recognition and Synthesis