TL;DR
This paper presents a novel hearable system that isolates target speech in noisy environments using a brief, noisy enrollment sample obtained by looking at the speaker, enabling effective speech extraction without clean examples.
Contribution
Introduces a new enrollment interface capturing noisy target speech via gaze, enabling robust speech separation in real-world noisy settings without clean enrollment data.
Findings
Achieves 7.01 dB signal quality improvement with less than 5 seconds of noisy enrollment
Processes 8 ms audio chunks in 6.24 ms on embedded CPU
Generalizes well to real-world static and mobile speakers in diverse environments
Abstract
In crowded settings, the human brain can focus on speech from a target speaker, given prior knowledge of how they sound. We introduce a novel intelligent hearable system that achieves this capability, enabling target speech hearing to ignore all interfering speech and noise, but the target speaker. A naive approach is to require a clean speech example to enroll the target speaker. This is however not well aligned with the hearable application domain since obtaining a clean example is challenging in real world scenarios, creating a unique user interface problem. We present the first enrollment interface where the wearer looks at the target speaker for a few seconds to capture a single, short, highly noisy, binaural example of the target speaker. This noisy example is used for enrollment and subsequent speech extraction in the presence of interfering speakers and noise. Our system…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsFocus
