Neural Predictive Coding using Convolutional Neural Networks towards Unsupervised Learning of Speaker Characteristics
Arindam Jati, Panayiotis Georgiou

TL;DR
This paper introduces Neural Predictive Coding (NPC), an unsupervised learning framework using convolutional neural networks to extract speaker-specific features from unlabeled audio data, even with non-speech and multi-speaker content.
Contribution
The paper proposes a novel unsupervised method, NPC, leveraging a short-term active-speaker stationarity hypothesis and siamese networks to learn speaker embeddings from unlabeled data.
Findings
NPC embeddings outperform in short-duration speaker identification.
NPC provides complementary information to i-vectors in full-utterance scenarios.
In large-scale verification, NPC compares favorably with supervised methods.
Abstract
Learning speaker-specific features is vital in many applications like speaker recognition, diarization and speech recognition. This paper provides a novel approach, we term Neural Predictive Coding (NPC), to learn speaker-specific characteristics in a completely unsupervised manner from large amounts of unlabeled training data that even contain many non-speech events and multi-speaker audio streams. The NPC framework exploits the proposed short-term active-speaker stationarity hypothesis which assumes two temporally-close short speech segments belong to the same speaker, and thus a common representation that can encode the commonalities of both the segments, should capture the vocal characteristics of that speaker. We train a convolutional deep siamese network to produce "speaker embeddings" by learning to separate `same' vs `different' speaker pairs which are generated from an…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
