# Accurate semi-supervised automatic speech recognition for ordinary and characterized speeches via multi-hypotheses-based curriculum learning

**Authors:** Ka Hyun Park, Junghun Kim, U Kang

PMC · DOI: 10.1371/journal.pone.0333915 · PLOS One · 2025-10-21

## TL;DR

This paper introduces new semi-supervised ASR models that improve transcription accuracy for both regular and special speech types using multiple hypotheses and curriculum learning.

## Contribution

The novel contribution is MOCA and MOCA-S, which use multi-hypotheses-based curriculum learning to reduce reliance on pseudo-labels in semi-supervised ASR.

## Key findings

- MOCA and MOCA-S significantly improve ASR accuracy compared to previous models.
- MOCA-S effectively handles limited data for characterized speech by leveraging other speech traits.
- The framework reduces sensitivity to pseudo-label quality in semi-supervised settings.

## Abstract

How can we build accurate transcription models for both ordinary speech and characterized speech in a semi-supervised setting? ASR (Automatic Speech Recognition) systems are widely used in various real-world applications, including translation systems and transcription services. ASR models are tailored to serve one of two types of speeches: 1) ordinary speech (e.g., speeches from the general population) and 2) characterized speech (e.g., speeches from speakers with special traits, such as certain nationalities or speech disorders). Recently, the limited availability of labeled speech data and the high cost of manual labeling have drawn significant attention to the development of semi-supervised ASR systems. Previous semi-supervised ASR models employ a pseudo-labeling scheme to incorporate unlabeled examples during training. However, these methods rely heavily on pseudo labels during training and are therefore highly sensitive to the quality of pseudo labels. The issue of low-quality pseudo labels is particularly pronounced for characterized speech, due to the limited availability of data specific to a certain trait. This scarcity hinders the initial ASR model’s ability to effectively capture the unique characteristics of characterized speech, resulting in inaccurate pseudo labels. In this paper, we propose a framework for training accurate ASR models for both ordinary and characterized speeches in a semi-supervised setting. Specifically, we propose MOCA (Multi-hypotheses-based Curriculum learning for semi-supervised Asr) for ordinary speech and MOCA-S for characterized speech. MOCA and MOCA-S generate multiple hypotheses for each speech instance to reduce the heavy reliance on potentially inaccurate pseudo labels. Moreover, MOCA-S for characterized speech effectively supplements the limited trait-specific speech data by exploiting speeches of the other traits. Specifically, MOCA-S adjusts the number of pseudo labels based on the relevance to the target trait. Extensive experiments on real-world speech datasets show that MOCA and MOCA-S significantly improve the accuracy of previous ASR models.

## Full-text entities

- **Diseases:** speech disorders (MESH:D013064)

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/PMC12539715/full.md

## Figures

5 figures with captions in the complete paper: https://tomesphere.com/paper/PMC12539715/full.md

## References

34 references — full list in the complete paper: https://tomesphere.com/paper/PMC12539715/full.md

---
Source: https://tomesphere.com/paper/PMC12539715