Audio-Visual Active Speaker Extraction for Sparsely Overlapped Multi-talker Speech
Junjie Li, Ruijie Tao, Zexu Pan, Meng Ge, Shuai Wang, Haizhou Li

TL;DR
This paper introduces ActiveExtract, an audio-visual model for extracting target speaker speech in sparsely overlapped multi-talker scenarios, outperforming baselines by over 4 dB SI-SNR.
Contribution
The paper presents a novel audio-visual speaker extraction model that detects speaking activity and disentangles speech in sparsely overlapped conversations, leveraging active speaker detection.
Findings
Outperforms baseline models across various overlap ratios.
Achieves over 4 dB improvement in SI-SNR on average.
Effectively detects target speaker activity and disentangles speech.
Abstract
Target speaker extraction aims to extract the speech of a specific speaker from a multi-talker mixture as specified by an auxiliary reference. Most studies focus on the scenario where the target speech is highly overlapped with the interfering speech. However, this scenario only accounts for a small percentage of real-world conversations. In this paper, we aim at the sparsely overlapped scenarios in which the auxiliary reference needs to perform two tasks simultaneously: detect the activity of the target speaker and disentangle the active speech from any interfering speech. We propose an audio-visual speaker extraction model named ActiveExtract, which leverages speaking activity from audio-visual active speaker detection (ASD). The ASD directly provides the frame-level activity of the target speaker, while its intermediate feature representation is trained to discriminate speech-lip…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and Audio Processing · Speech Recognition and Synthesis · Music and Audio Processing
