MIRNet: Learning multiple identities representations in overlapped speech
Hyewon Han, Soo-Whan Chung, Hong-Goo Kang

TL;DR
This paper introduces a deep learning method to extract multiple speaker identities from overlapped speech signals, enabling improved speaker verification and speech separation without needing reference features.
Contribution
It proposes a novel deep speaker representation network that extracts multiple speaker identities directly from overlapped speech using only identity labels, unlike traditional methods.
Findings
Effective in speaker verification tasks
Improves speech separation conditioned on speaker embeddings
Requires only speaker identity labels for training
Abstract
Many approaches can derive information about a single speaker's identity from the speech by learning to recognize consistent characteristics of acoustic parameters. However, it is challenging to determine identity information when there are multiple concurrent speakers in a given signal. In this paper, we propose a novel deep speaker representation strategy that can reliably extract multiple speaker identities from an overlapped speech. We design a network that can extract a high-level embedding that contains information about each speaker's identity from a given mixture. Unlike conventional approaches that need reference acoustic features for training, our proposed algorithm only requires the speaker identity labels of the overlapped speech segments. We demonstrate the effectiveness and usefulness of our algorithm in a speaker verification task and a speech separation system…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing
