TL;DR
This paper introduces RAVSS, a novel multi-speaker audio-visual speech separation framework that performs concurrent separation, handles missing visual cues, and achieves state-of-the-art results on multiple datasets.
Contribution
It presents a simultaneous multi-speaker separation method with speaker-wise interactions, improving robustness to missing visual cues and outperforming existing approaches.
Findings
Achieves state-of-the-art separation accuracy for 2-5 speakers.
Demonstrates robustness to missing or partial visual information.
Outperforms existing methods across multiple datasets.
Abstract
While existing Audio-Visual Speech Separation (AVSS) methods primarily concentrate on the audio-visual fusion strategy for two-speaker separation, they demonstrate a severe performance drop in the multi-speaker separation scenarios. Typically, AVSS methods employ guiding videos to sequentially isolate individual speakers from the given audio mixture, resulting in notable missing and noisy parts across various segments of the separated speech. In this study, we propose a simultaneous multi-speaker separation framework that can facilitate the concurrent separation of multiple speakers within a singular process. We introduce speaker-wise interactions to establish distinctions and correlations among speakers. Experimental results on the VoxCeleb2 and LRS3 datasets demonstrate that our method achieves state-of-the-art performance in separating mixtures with 2, 3, 4, and 5 speakers,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
