Speech Representation Learning Combining Conformer CPC with Deep Cluster   for the ZeroSpeech Challenge 2021

Takashi Maekaku; Xuankai Chang; Yuya Fujita; Li-Wei Chen; Shinji; Watanabe; Alexander Rudnicky

arXiv:2107.05899·cs.SD·February 17, 2022

Speech Representation Learning Combining Conformer CPC with Deep Cluster for the ZeroSpeech Challenge 2021

Takashi Maekaku, Xuankai Chang, Yuya Fujita, Li-Wei Chen, Shinji, Watanabe, Alexander Rudnicky

PDF

TL;DR

This paper introduces a speech representation learning system combining Conformer CPC with deep clustering, achieving top results in the ZeroSpeech 2021 challenge by improving phonetic, lexical, and syntactic metrics.

Contribution

It proposes a novel combination of Conformer CPC with deep clustering and autoregressive classification, enhancing speech representations for zero-resource tasks.

Findings

01

35% improvement in phonetic metric

02

1.5% improvement in lexical metric

03

2.3% improvement in syntactic metric

Abstract

We present a system for the Zero Resource Speech Challenge 2021, which combines a Contrastive Predictive Coding (CPC) with deep cluster. In deep cluster, we first prepare pseudo-labels obtained by clustering the outputs of a CPC network with k-means. Then, we train an additional autoregressive model to classify the previously obtained pseudo-labels in a supervised manner. Phoneme discriminative representation is achieved by executing the second-round clustering with the outputs of the final layer of the autoregressive model. We show that replacing a Transformer layer with a Conformer layer leads to a further gain in a lexical metric. Experimental results show that a relative improvement of 35% in a phonetic metric, 1.5% in the lexical metric, and 2.3% in a syntactic metric are achieved compared to a baseline method of CPC-small which is trained on LibriSpeech 460h data. We achieve top…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.