Tune-In: Training Under Negative Environments with Interference for Attention Networks Simulating Cocktail Party Effect
Jun Wang, Max W. Y. Lam, Dan Su, Dong Yu

TL;DR
Tune-In introduces a novel attention network that mimics human cocktail party effect, enabling robust speaker separation and verification under interference with lower resource consumption.
Contribution
The paper proposes a new attention network with a dual-space structure and cross-attention mechanisms for improved speech separation and speaker verification in noisy environments.
Findings
Learns discriminative speaker representations in interference conditions
Achieves superior speech separation performance (SI-SNRi, SDRi)
Uses less memory and computation than state-of-the-art methods
Abstract
We study the cocktail party problem and propose a novel attention network called Tune-In, abbreviated for training under negative environments with interference. It firstly learns two separate spaces of speaker-knowledge and speech-stimuli based on a shared feature space, where a new block structure is designed as the building block for all spaces, and then cooperatively solves different tasks. Between the two spaces, information is cast towards each other via a novel cross- and dual-attention mechanism, mimicking the bottom-up and top-down processes of a human's cocktail party effect. It turns out that substantially discriminative and generalizable speaker representations can be learnt in severely interfered conditions via our self-supervised training. The experimental results verify this seeming paradox. The learnt speaker embedding has superior discriminative power than a standard…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsSpeech and Audio Processing · Speech Recognition and Synthesis · Music and Audio Processing
