WASE: Learning When to Attend for Speaker Extraction in Cocktail Party Environments
Yunzhe Hao, Jiaming Xu, Peng Zhang, Bo Xu

TL;DR
This paper introduces a novel speaker extraction method that explicitly models sound onset and offset cues, improving performance by combining these cues with voiceprint information, and achieving near state-of-the-art results with fewer parameters.
Contribution
It is the first to explicitly incorporate sound onset/offset cues into speaker extraction, enhancing accuracy and task completeness by combining auditory scene analysis principles.
Findings
Performance close to state-of-the-art with fewer parameters
Effective integration of onset/offset cues with voiceprint
Improved speaker extraction accuracy
Abstract
In the speaker extraction problem, it is found that additional information from the target speaker contributes to the tracking and extraction of the target speaker, which includes voiceprint, lip movement, facial expression, and spatial information. However, no one cares for the cue of sound onset, which has been emphasized in the auditory scene analysis and psychology. Inspired by it, we explicitly modeled the onset cue and verified the effectiveness in the speaker extraction task. We further extended to the onset/offset cues and got performance improvement. From the perspective of tasks, our onset/offset-based model completes the composite task, a complementary combination of speaker extraction and speaker-dependent voice activity detection. We also combined voiceprint with onset/offset cues. Voiceprint models voice characteristics of the target while onset/offset models the start/end…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and Audio Processing · Music and Audio Processing · Speech Recognition and Synthesis
