TL;DR
This paper investigates using co-speech gestures as cues for speaker extraction, demonstrating that gestures can effectively aid in isolating a target speaker's speech from multi-talker audio, especially with low-resolution video data.
Contribution
It introduces two novel neural network approaches that incorporate co-speech gestures as cues for speaker extraction, expanding the modalities used beyond traditional face or pre-recorded speech samples.
Findings
Co-speech gestures improve speaker association accuracy.
Gesture-based models outperform baseline methods without gesture cues.
Gestures are effective even with low-resolution video recordings.
Abstract
Speaker extraction seeks to extract the clean speech of a target speaker from a multi-talker mixture speech. There have been studies to use a pre-recorded speech sample or face image of the target speaker as the speaker cue. In human communication, co-speech gestures that are naturally timed with speech also contribute to speech perception. In this work, we explore the use of co-speech gestures sequence, e.g. hand and body movements, as the speaker cue for speaker extraction, which could be easily obtained from low-resolution video recordings, thus more available than face recordings. We propose two networks using the co-speech gestures cue to perform attentive listening on the target speaker, one that implicitly fuses the co-speech gestures cue in the speaker extraction process, the other performs speech separation first, followed by explicitly using the co-speech gestures cue to…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
