Unsupervised Automatic Speech Recognition: A Review
Hanan Aldarmaki, Asad Ullah, Nazar Zaki

TL;DR
This review explores the potential and challenges of developing fully unsupervised automatic speech recognition systems, focusing on models that learn from speech data without extensive labeled datasets, especially for low-resource languages.
Contribution
It synthesizes existing research on unsupervised and semi-supervised ASR methods, highlighting current limitations and minimum data requirements for effective speech recognition.
Findings
Unsupervised segmentation of speech signals is feasible but challenging.
Mapping speech segments to text remains a key hurdle.
Understanding data requirements can optimize ASR development for low-resource languages.
Abstract
Automatic Speech Recognition (ASR) systems can be trained to achieve remarkable performance given large amounts of manually transcribed speech, but large labeled data sets can be difficult or expensive to acquire for all languages of interest. In this paper, we review the research literature to identify models and ideas that could lead to fully unsupervised ASR, including unsupervised segmentation of the speech signal, unsupervised mapping from speech segments to text, and semi-supervised models with nominal amounts of labeled examples. The objective of the study is to identify the limitations of what can be learned from speech data alone and to understand the minimum requirements for speech recognition. Identifying these limitations would help optimize the resources and efforts in ASR development for low-resource languages.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
