VCSE: Time-Domain Visual-Contextual Speaker Extraction Network
Junjie Li, Meng Ge, Zexu Pan, Longbiao Wang, Jianwu Dang

TL;DR
This paper introduces VCSE, a two-stage time-domain neural network that fuses visual and contextual cues for improved speaker extraction in multi-talker scenarios, demonstrating superior performance on the LRS3 dataset.
Contribution
The paper presents a novel two-stage framework that sequentially integrates visual and contextual information for enhanced speaker extraction, which was not previously combined in this manner.
Findings
VCSE outperforms state-of-the-art baselines on LRS3 dataset.
Two-stage approach effectively leverages visual and contextual cues.
Significant improvement in speech extraction accuracy.
Abstract
Speaker extraction seeks to extract the target speech in a multi-talker scenario given an auxiliary reference. Such reference can be auditory, i.e., a pre-recorded speech, visual, i.e., lip movements, or contextual, i.e., phonetic sequence. References in different modalities provide distinct and complementary information that could be fused to form top-down attention on the target speaker. Previous studies have introduced visual and contextual modalities in a single model. In this paper, we propose a two-stage time-domain visual-contextual speaker extraction network named VCSE, which incorporates visual and self-enrolled contextual cues stage by stage to take full advantage of every modality. In the first stage, we pre-extract a target speech with visual cues and estimate the underlying phonetic sequence. In the second stage, we refine the pre-extracted target speech with the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and Audio Processing · Speech Recognition and Synthesis · Music and Audio Processing
