VCSE: Time-Domain Visual-Contextual Speaker Extraction Network

Junjie Li; Meng Ge; Zexu Pan; Longbiao Wang; Jianwu Dang

arXiv:2210.06177·cs.CV·October 13, 2022

VCSE: Time-Domain Visual-Contextual Speaker Extraction Network

Junjie Li, Meng Ge, Zexu Pan, Longbiao Wang, Jianwu Dang

PDF

Open Access

TL;DR

This paper introduces VCSE, a two-stage time-domain neural network that fuses visual and contextual cues for improved speaker extraction in multi-talker scenarios, demonstrating superior performance on the LRS3 dataset.

Contribution

The paper presents a novel two-stage framework that sequentially integrates visual and contextual information for enhanced speaker extraction, which was not previously combined in this manner.

Findings

01

VCSE outperforms state-of-the-art baselines on LRS3 dataset.

02

Two-stage approach effectively leverages visual and contextual cues.

03

Significant improvement in speech extraction accuracy.

Abstract

Speaker extraction seeks to extract the target speech in a multi-talker scenario given an auxiliary reference. Such reference can be auditory, i.e., a pre-recorded speech, visual, i.e., lip movements, or contextual, i.e., phonetic sequence. References in different modalities provide distinct and complementary information that could be fused to form top-down attention on the target speaker. Previous studies have introduced visual and contextual modalities in a single model. In this paper, we propose a two-stage time-domain visual-contextual speaker extraction network named VCSE, which incorporates visual and self-enrolled contextual cues stage by stage to take full advantage of every modality. In the first stage, we pre-extract a target speech with visual cues and estimate the underlying phonetic sequence. In the second stage, we refine the pre-extracted target speech with the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Speech Recognition and Synthesis · Music and Audio Processing