Tune-In: Training Under Negative Environments with Interference for   Attention Networks Simulating Cocktail Party Effect

Jun Wang; Max W. Y. Lam; Dan Su; Dong Yu

arXiv:2103.01461·eess.AS·March 3, 2021

Tune-In: Training Under Negative Environments with Interference for Attention Networks Simulating Cocktail Party Effect

Jun Wang, Max W. Y. Lam, Dan Su, Dong Yu

PDF

Open Access 1 Video

TL;DR

Tune-In introduces a novel attention network that mimics human cocktail party effect, enabling robust speaker separation and verification under interference with lower resource consumption.

Contribution

The paper proposes a new attention network with a dual-space structure and cross-attention mechanisms for improved speech separation and speaker verification in noisy environments.

Findings

01

Learns discriminative speaker representations in interference conditions

02

Achieves superior speech separation performance (SI-SNRi, SDRi)

03

Uses less memory and computation than state-of-the-art methods

Abstract

We study the cocktail party problem and propose a novel attention network called Tune-In, abbreviated for training under negative environments with interference. It firstly learns two separate spaces of speaker-knowledge and speech-stimuli based on a shared feature space, where a new block structure is designed as the building block for all spaces, and then cooperatively solves different tasks. Between the two spaces, information is cast towards each other via a novel cross- and dual-attention mechanism, mimicking the bottom-up and top-down processes of a human's cocktail party effect. It turns out that substantially discriminative and generalizable speaker representations can be learnt in severely interfered conditions via our self-supervised training. The experimental results verify this seeming paradox. The learnt speaker embedding has superior discriminative power than a standard…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Tune-In: Training under Negative Environments with Interference for Attention Networks Simulating Cocktail Party Effect· underline

Taxonomy

TopicsSpeech and Audio Processing · Speech Recognition and Synthesis · Music and Audio Processing