ContentVec: An Improved Self-Supervised Speech Representation by   Disentangling Speakers

Kaizhi Qian; Yang Zhang; Heting Gao; Junrui Ni; Cheng-I Lai; David; Cox; Mark Hasegawa-Johnson; Shiyu Chang

arXiv:2204.09224·cs.SD·June 27, 2022·24 cites

ContentVec: An Improved Self-Supervised Speech Representation by Disentangling Speakers

Kaizhi Qian, Yang Zhang, Heting Gao, Junrui Ni, Cheng-I Lai, David, Cox, Mark Hasegawa-Johnson, Shiyu Chang

PDF

Open Access 1 Repo

TL;DR

This paper introduces ContentVec, a self-supervised speech representation method that effectively disentangles speaker information from speech content, improving downstream task performance without losing content quality.

Contribution

It presents a novel SSL approach based on HuBERT that achieves speaker disentanglement through regularization, addressing a key challenge in speech representation learning.

Findings

01

Speaker-disentangled representations outperform baseline in downstream tasks

02

Proposed method maintains content integrity while removing speaker info

03

Consistent performance improvements across multiple evaluations

Abstract

Self-supervised learning in speech involves training a speech representation network on a large-scale unannotated speech corpus, and then applying the learned representations to downstream tasks. Since the majority of the downstream tasks of SSL learning in speech largely focus on the content information in speech, the most desirable speech representations should be able to disentangle unwanted variations, such as speaker variations, from the content. However, disentangling speakers is very challenging, because removing the speaker information could easily result in a loss of content as well, and the damage of the latter usually far outweighs the benefit of the former. In this paper, we propose a new SSL method that can achieve speaker disentanglement without severe loss of content. Our approach is adapted from the HuBERT framework, and incorporates disentangling mechanisms to…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

auspicious3000/contentvec
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Natural Language Processing Techniques · Topic Modeling