Audio-Visual Active Speaker Extraction for Sparsely Overlapped   Multi-talker Speech

Junjie Li; Ruijie Tao; Zexu Pan; Meng Ge; Shuai Wang; Haizhou Li

arXiv:2309.08408·cs.SD·September 18, 2023

Audio-Visual Active Speaker Extraction for Sparsely Overlapped Multi-talker Speech

Junjie Li, Ruijie Tao, Zexu Pan, Meng Ge, Shuai Wang, Haizhou Li

PDF

Open Access 1 Repo

TL;DR

This paper introduces ActiveExtract, an audio-visual model for extracting target speaker speech in sparsely overlapped multi-talker scenarios, outperforming baselines by over 4 dB SI-SNR.

Contribution

The paper presents a novel audio-visual speaker extraction model that detects speaking activity and disentangles speech in sparsely overlapped conversations, leveraging active speaker detection.

Findings

01

Outperforms baseline models across various overlap ratios.

02

Achieves over 4 dB improvement in SI-SNR on average.

03

Effectively detects target speaker activity and disentangles speech.

Abstract

Target speaker extraction aims to extract the speech of a specific speaker from a multi-talker mixture as specified by an auxiliary reference. Most studies focus on the scenario where the target speech is highly overlapped with the interfering speech. However, this scenario only accounts for a small percentage of real-world conversations. In this paper, we aim at the sparsely overlapped scenarios in which the auxiliary reference needs to perform two tasks simultaneously: detect the activity of the target speaker and disentangle the active speech from any interfering speech. We propose an audio-visual speaker extraction model named ActiveExtract, which leverages speaking activity from audio-visual active speaker detection (ASD). The ASD directly provides the frame-level activity of the target speaker, while its intermediate feature representation is trained to discriminate speech-lip…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

mrjunjieli/activeextract
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Speech Recognition and Synthesis · Music and Audio Processing