# A study on phonemes recognition method for Mandarin pronunciation based on improved Zipformer-RNN-T(Pruned) modeling

**Authors:** Zhaohui Du, Xiaofeng Zhao, Lin Li, Baohua Yu, Lijiang Miao, Jie Zhang, Jie Zhang, Jie Zhang

PMC · DOI: 10.1371/journal.pone.0324048 · PLOS One · 2025-05-23

## TL;DR

This paper introduces an improved Zipformer-RNN-T model for Mandarin phoneme recognition, achieving high accuracy with fewer parameters.

## Contribution

Proposes a pruned Zipformer-RNN-T model with GELU and hybrid loss for efficient Mandarin phoneme recognition.

## Key findings

- Achieved 1.92% WER on AISHELL1-PHONEME dev set and 2.12% on test set.
- Model uses only 61.1M parameters, balancing performance and efficiency.
- Hybrid Pruned RNN-T/CTC loss improved recognition performance significantly.

## Abstract

In recent years, empowered by artificial intelligence technologies, computer-assisted language learning systems have gradually become a hot topic of research. Currently, the mainstream pronunciation assessment models rely on advanced speech recognition technology, converting speech into phoneme sequences, and then determining mispronounced phonemes through sequence comparison. To optimize the phoneme recognition task in pronunciation evaluation, this paper proposes a Chinese pronunciation phoneme recognition model based on the improved Zipformer-RNN-T(Pruned) architecture, aiming to improve recognition accuracy and reduce parameter count. First, the AISHELL1-PHONEME and ST-CMDS-PHONEME datasets for Mandarin phoneme recognition through data preprocessing. Then, three layers of the Zipformer Block architecture are introduced into the Zipformer encoder to significantly enhance model performance. In the stateless Pred Network, the GELU activation function is adopted to effectively prevent neuron deactivation. Furthermore, a hybrid Pruned RNN-T/CTC Loss fusion strategy is proposed, further optimizing recognition performance. The experimental results demonstrate that the method performs excellently in the phoneme recognition task, achieving a Word Error Rate (WER) of 1.92% (Dev) and 2.12% (Test) on the AISHELL1-PHONEME dataset, and 4.28% (Dev) and 4.51% (Test) on the ST-CMDS-PHONEME dataset. Moreover, the model requires only 61.1M parameters, striking a balance between performance and efficiency.

## Full-text entities

- **Diseases:** confusion (MESH:D003221), CTC (MESH:D008310), RNN-T (MESH:D001260), fever (MESH:D005334)
- **Chemicals:** CTC (-)
- **Species:** Homo sapiens (human, species) [taxon 9606]

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/PMC12101685/full.md

## Figures

13 figures with captions in the complete paper: https://tomesphere.com/paper/PMC12101685/full.md

## References

24 references — full list in the complete paper: https://tomesphere.com/paper/PMC12101685/full.md

---
Source: https://tomesphere.com/paper/PMC12101685