From Semi-supervised to Almost-unsupervised Speech Recognition with   Very-low Resource by Jointly Learning Phonetic Structures from Audio and Text   Embeddings

Yi-Chen Chen; Sung-Feng Huang; Hung-yi Lee; Lin-shan Lee

arXiv:1904.05078·cs.CL·April 11, 2019·1 cites

From Semi-supervised to Almost-unsupervised Speech Recognition with Very-low Resource by Jointly Learning Phonetic Structures from Audio and Text Embeddings

Yi-Chen Chen, Sung-Feng Huang, Hung-yi Lee, Lin-shan Lee

PDF

Open Access

TL;DR

This paper explores a semi-supervised approach to low-resource speech recognition by jointly learning phonetic structures from audio and text embeddings, inspired by how infants learn language from limited examples.

Contribution

It introduces a method combining Audio Word2Vec and autoencoders to learn phonetic structures from both speech and text, enabling speech recognition with minimal labeled data.

Findings

01

Achieved 44.6% WER with only 2.1 hours of labeled data

02

Reduced WER to 34.2% with 4.1 hours of labeled data

03

Demonstrates potential of joint phonetic structure learning for low-resource ASR

Abstract

Producing a large amount of annotated speech data for training ASR systems remains difficult for more than 95% of languages all over the world which are low-resourced. However, we note human babies start to learn the language by the sounds (or phonetic structures) of a small number of exemplar words, and "generalize" such knowledge to other words without hearing a large amount of data. We initiate some preliminary work in this direction. Audio Word2Vec is used to learn the phonetic structures from spoken words (signal segments), while another autoencoder is used to learn the phonetic structures from text words. The relationships among the above two can be learned jointly, or separately after the above two are well trained. This relationship can be used in speech recognition with very low resource. In the initial experiments on the TIMIT dataset, only 2.1 hours of speech data (in which…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Music and Audio Processing · Speech and Audio Processing

MethodsSolana Customer Service Number +1-833-534-1729