Learning Phone Recognition from Unpaired Audio and Phone Sequences Based   on Generative Adversarial Network

Da-rong Liu; Po-chun Hsu; Yi-chen Chen; Sung-feng Huang; Shun-po; Chuang; Da-yi Wu; and Hung-yi Lee

arXiv:2207.14568·cs.SD·August 1, 2022

Learning Phone Recognition from Unpaired Audio and Phone Sequences Based on Generative Adversarial Network

Da-rong Liu, Po-chun Hsu, Yi-chen Chen, Sung-feng Huang, Shun-po, Chuang, Da-yi Wu, and Hung-yi Lee

PDF

Open Access

TL;DR

This paper presents a novel two-stage GAN-based framework for learning phone recognition directly from unpaired speech and phone sequences, reducing reliance on large paired datasets.

Contribution

It introduces a new iterative approach combining GANs and HMMs to improve phone recognition from unpaired data, outperforming existing methods.

Findings

01

Outperforms acoustic unit discovery methods on TIMIT dataset

02

Effectively learns from unpaired speech and phone sequences

03

Enhances segmentation accuracy for phone recognition

Abstract

ASR has been shown to achieve great performance recently. However, most of them rely on massive paired data, which is not feasible for low-resource languages worldwide. This paper investigates how to learn directly from unpaired phone sequences and speech utterances. We design a two-stage iterative framework. GAN training is adopted in the first stage to find the mapping relationship between unpaired speech and phone sequence. In the second stage, another HMM model is introduced to train from the generator's output, which boosts the performance and provides a better segmentation for the next iteration. In the experiment, we first investigate different choices of model designs. Then we compare the framework to different types of baselines: (i) supervised methods (ii) acoustic unit discovery based methods (iii) methods learning from unpaired data. Our framework performs consistently…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Speech Recognition and Synthesis · Music and Audio Processing