Towards Unsupervised Speech Recognition Without Pronunciation Models

Junrui Ni; Liming Wang; Yang Zhang; Kaizhi Qian; Heting Gao; Mark; Hasegawa-Johnson; Chang D. Yoo

arXiv:2406.08380·cs.CL·January 10, 2025

Towards Unsupervised Speech Recognition Without Pronunciation Models

Junrui Ni, Liming Wang, Yang Zhang, Kaizhi Qian, Heting Gao, Mark, Hasegawa-Johnson, Chang D. Yoo

PDF

Open Access 1 Repo

TL;DR

This paper introduces a novel unsupervised speech recognition method that does not rely on pronunciation models or paired data, achieving competitive accuracy through joint speech and text modeling.

Contribution

It proposes a new approach for word-level unsupervised ASR that removes the need for phoneme lexicons and demonstrates its effectiveness on English speech data.

Findings

01

Achieves 20-23% word error rate without parallel transcripts

02

Outperforms previous lexicon-free unsupervised ASR models

03

Successfully refines word segmentation iteratively

Abstract

Recent advancements in supervised automatic speech recognition (ASR) have achieved remarkable performance, largely due to the growing availability of large transcribed speech corpora. However, most languages lack sufficient paired speech and text data to effectively train these systems. In this article, we tackle the challenge of developing ASR systems without paired speech and text corpora by proposing the removal of reliance on a phoneme lexicon. We explore a new research direction: word-level unsupervised ASR, and experimentally demonstrate that an unsupervised speech recognizer can emerge from joint speech-to-speech and text-to-text masked token-infilling. Using a curated speech corpus containing a fixed number of English words, our system iteratively refines the word segmentation structure and achieves a word error rate of between 20-23%, depending on the vocabulary size, without…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

jeromeni/wholeword-uasr-jstti
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing