USEV: Universal Speaker Extraction with Visual Cue
Zexu Pan, Meng Ge, Haizhou Li

TL;DR
This paper introduces USEV, a universal speaker extraction model utilizing visual lip movement cues, capable of handling diverse multi-talker scenarios including overlapping and absent speakers, with improved fidelity over baselines.
Contribution
It proposes a novel universal speaker extraction network using visual cues and a scenario-aware loss function for balanced performance across scenarios.
Findings
Outperforms baseline methods in signal fidelity.
Effective in scenarios with varying speaker overlap and absence.
Utilizes visual lip movement cues for robust speaker extraction.
Abstract
A speaker extraction algorithm seeks to extract the target speaker's speech from a multi-talker speech mixture. The prior studies focus mostly on speaker extraction from a highly overlapped multi-talker speech mixture. However, the target-interference speaker overlapping ratios could vary over a wide range from 0% to 100% in natural speech communication, furthermore, the target speaker could be absent in the speech mixture, the speech mixtures in such universal multi-talker scenarios are described as general speech mixtures. The speaker extraction algorithm requires an auxiliary reference, such as a video recording or a pre-recorded speech, to form top-down auditory attention on the target speaker. We advocate that a visual cue, i.e., lip movement, is more informative than an audio cue, i.e., pre-recorded speech, to serve as the auxiliary reference for speaker extraction in…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and Audio Processing · Music and Audio Processing · Speech Recognition and Synthesis
