AVFSNet: Audio-Visual Speech Separation for Flexible Number of Speakers with Multi-Scale and Multi-Task Learning
Daning Zhang, Ying Wei

TL;DR
AVFSNet is a novel audio-visual speech separation model that effectively handles unknown numbers of speakers by integrating multi-scale encoding and multi-task learning, achieving state-of-the-art results.
Contribution
The paper introduces AVFSNet, a new model that combines multi-scale encoding and parallel architecture for speaker counting and separation without prior speaker number knowledge.
Findings
Achieves state-of-the-art separation performance
Demonstrates robustness in noisy environments
Handles unknown speaker quantities effectively
Abstract
Separating target speech from mixed signals containing flexible speaker quantities presents a challenging task. While existing methods demonstrate strong separation performance and noise robustness, they predominantly assume prior knowledge of speaker counts in mixtures. The limited research addressing unknown speaker quantity scenarios exhibits significantly constrained generalization capabilities in real acoustic environments. To overcome these challenges, this paper proposes AVFSNet -- an audio-visual speech separation model integrating multi-scale encoding and parallel architecture -- jointly optimized for speaker counting and multi-speaker separation tasks. The model independently separates each speaker in parallel while enhancing environmental noise adaptability through visual information integration. Comprehensive experimental evaluations demonstrate that AVFSNet achieves…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and Audio Processing · Music and Audio Processing · Speech Recognition and Synthesis
