Autoregressive Co-Training for Learning Discrete Speech Representations

Sung-Lin Yeh; Hao Tang

arXiv:2203.15840·cs.CL·November 1, 2022

Autoregressive Co-Training for Learning Discrete Speech Representations

Sung-Lin Yeh, Hao Tang

PDF

Open Access 1 Repo

TL;DR

This paper introduces a generative model with discrete latent variables for speech, optimized via information-theoretic co-training, which outperforms existing methods like HuBERT and vector quantization in phonetic correlation.

Contribution

It proposes a novel co-training framework for learning discrete speech representations that unifies and extends existing approaches.

Findings

01

Learned representations are highly correlated with phonetic units.

02

Outperforms HuBERT-like training and vector quantization in phonetic correlation.

03

Framework is flexible and can be optimized with multiple approaches.

Abstract

While several self-supervised approaches for learning discrete speech representation have been proposed, it is unclear how these seemingly similar approaches relate to each other. In this paper, we consider a generative model with discrete latent variables that learns a discrete representation for speech. The objective of learning the generative model is formulated as information-theoretic co-training. Besides the wide generality, the objective can be optimized with several approaches, subsuming HuBERT-like training and vector quantization for learning discrete representation. Empirically, we find that the proposed approach learns discrete representation that is highly correlated with phonetic units, more correlated than HuBERT-like training and vector quantization.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

30stomercury/autoregressive-co-training
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing